roger-dearnaley

Posts
Comments

Posts

Just How Hard a Problem is Alignment? 2023-02-25T09:00:06.066Z

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning 2023-02-21T09:05:43.010Z

Comments

Comment by Roger Dearnaley on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight · 2024-10-20T19:36:27.777Z · LW · GW

I expect (1) and (2) to go mostly fine (though see footnote^[6]), and for (3) to completely fail. For example, suppose has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating $C_{g}$ vs. $C_{b}$ to an equally-difficult problem of discriminating desired vs. undesired features.

However, if we found that the classifier was using a feature for "How smart is the human asking the question?" to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.

Comment by Roger Dearnaley on Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models? · 2024-10-20T05:06:33.041Z · LW · GW

Model/FinetuneGlobal Mean (Cosine) Similarity
Gemma-2b/Gemma-2b-Python-codes0.6691
Mistral-7b/Mistral-7b-MetaMath0.9648

As an ex-Googler, my informed guess would be that Gemma 2B will have been distilled (or perhaps both pruned and distilled) from a larger teacher model, presumably some larger Gemini model — and that Gemma-2b and Gemma-2b-Python-codes may well have been distilled separately, as separate students of two similar-but-not-identical teacher models distilled using different teaching datasets. The fact that the global mean cosine you find here isn't ~0 shows that if so, the separate distillation processes were either warm-started from similar models (presumably a weak 2B model — a sensible way to save some distillation expense), or at least shared the same initial token embeddings/unembeddings.

Regardless of how they were created, these two Gemma models clearly differ pretty significantly, so I'm unsurprised by your subsequent discovery that the SAE basically doesn't transfer between them.

For Mistral 7B, I would be astonished if any distillation was involved, I would expect just a standard combination of fine-tuning followed by either RLHF or something along the lines of DPO. In very high dimensional space, a cosine of 0.96 means "almost identical", so clearly the instruct training here consists of fairly small, targeted changes, and I'm unsurprised that as a result the SAE transfers quite well.

Comment by Roger Dearnaley on The case for unlearning that removes information from LLM weights · 2024-10-20T04:00:44.480Z · LW · GW

knowing if the intentions of a user are legitimate is very difficult

This sounds more like a security and authentication problem than an AI problem.

Comment by Roger Dearnaley on The case for unlearning that removes information from LLM weights · 2024-10-20T03:58:37.169Z · LW · GW

Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts - and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.

If you read e.g. "Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level", current thinking in mechanistic interpretability is that simple memorization of otherwise random facts (i.e. ones where there are no helpful "rules of thumb" to get many cases right) uses different kinds of learned neural circuits that learning of skills does. If this were in fact the case, then unlearning of facts and skills might have different characteristics. In particular, for learnt skills, if mechanistic interpretability can locate and interpret the learned circuit implementing a skills, then we can edit them directly (as has been done successfully in a few cases). However, we've so far had no luck at interpreting neural circuitry for large collections of basically-random facts, and there are some theoretical arguments suggesting that such circuitry may be inherently hard to interpret, at least using current techniques.

Comment by Roger Dearnaley on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-10-20T02:20:38.429Z · LW · GW

I think something that might be quite illuminating for this factual recall question (as well has having potential safety uses) is editing the facts. Suppose, for this case, you take just the layer 2-6 MLPs with frozen attention, you warm-start from the existing model, add a loss function term that penalizes changes to the model away from that initialization (perhaps using a combination of L1 and L2 norms), and train that model on a dataset that consists of your full set of athlete names embeddings and their sports (in a frequency ratio matching the original training data, so with more copies of data about more famous athletes), but with one factual edit to it to change the sport of one athlete. Train until it has memorized this slightly edited set of facts reasonably well, and then look at what the pattern of weights and neurons that changed significantly is. Repeat several times for the same athlete, and see how consistent/stochastic the results are. Repeat for many athlete/sport edit combinations. This should give you a lot of information on how widely distributed the representation of the data on a specific athlete is, and ought to give you enough information to distinguish between not only your single step vs hashing hypotheses, but also things like bagging and combinations of different algorithms. Even more interesting would be to do this not just for an athlete's sport, but also some independent fact about them like their number.

Of course, this is computationally somewhat expensive, and you do need to find metaparameters that don't make this too expensive, while encouraging the resulting change to be as localized as possible consistent with getting a good score on the altered fact.

Comment by Roger Dearnaley on Is Infra-Bayesianism Applicable to Value Learning? · 2023-05-15T00:10:15.741Z · LW · GW

I take your point that the way an Infra-Bayesian system makes decisions isn't the same as a human — it presumably doesn't share our cognitive biases, and the pessimism element 'Murphy' in it seems stronger than for most humans. I normally assume that if there's something I don't understand about the environment that's injecting noise into the outcome of my actions, the noise-related parts of results aren't going to be well-optimized, so they're going to be worse than I could have achieved had I had full understanding, but that even leaving things to chance I may sometimes get some good luck along with the bad — I don't generally assume that everything I can't control will have literally the worst possible outcome. So I guess in Infra-Bayesian terms I'm assuming that Murphy is somewhat constrained by laws that I'm not yet aware of, and may never be aware of.

My take on Murphy is that it's a systematization of the force of entropy trying to revert the environment to a thermodynamic equilibrium state, and of the common fact that the utility of that equilibrium state is usually pretty low. One of the flaws I see in Infra-Bayesianism is that there are sometimes (hard to reach but physically possible) states whose utility to me is even lower than the thermodynamic equilibrium (such as a policy that scores less than 20% on a 5-option multiple choice quiz so does worse than random guessing, or a minefield left over after a war that is actually worse than a blasted wasteland) where increasing entropy would actually help improve things. In a hellworld, randomly throwing money wrenches in the gears is a moderately effective strategy. In those unusual cases Infra-Bayesianism's Murphy no longer aligns with the actual effects of entropy/Knightian uncertainty.

Comment by Roger Dearnaley on Prospect Theory: A Framework for Understanding Cognitive Biases · 2023-05-14T20:47:03.989Z · LW · GW

The observation that gains saturate has a fairly simple explanation from evolutionary theory: the increased evolutionary fitness advantage from large material gains saturates (especially in a hunter-gatherer environment). Successfully hunting a rabbit will keep me from starving for a day; but if I successfully hunt a mammoth, I can't keep the meat for long enough for it to feed me for years. The best I can do is feed everyone in the village for a few days, hoping they remember this later when my hunting is less successful, and do a bunch of extra work to make some jerky with the rest. The evolutionary advantage is sub-linear in the kilograms of raw meat. In more recent agricultural societies, rich and powerful men like Ramses II who had O(100) children needed a lot more than 50 times the average resources of men in their society to achieve that outcome (and of course that evolutionary strategy isn't possible for women). Even today, if I'm unlucky enough to get pancreatic cancer it doesn't matter how rich I am: all that money isn't going to save me, even if I'm as rich as Steve Jobs.

Similarly, on the downside, from a personal evolutionary fitness point of view, saturation also makes sense, since there is a limit to how bad things can get: once I, my family, and everyone else in the tribe related to me are all dead, it's game over, and my personal evolutionary fitness doesn't really care whether everyone else in the region also died, or not.

So it seems to me that at least the first diagram above of prospect theory may be an example of humans being aligned with evolution's utility function.

I don't have a good evolutionary explanation for the second diagram, unless it's a mechanism to compensate for some psychological or statistical bias in how hunter-gatherers obtain information about and estimate risks, and/or how that compares to modern mathematical risk measures like probabilities and percentages.

Comment by Roger Dearnaley on Is Infra-Bayesianism Applicable to Value Learning? · 2023-05-14T11:02:53.925Z · LW · GW

We want our value-learner AI to learn to have the same preference order over outcomes as humans, which requires its goal to be to find (or at least learn to act according to) a utility function as close as possible to some aggregate of ours (if humans actually had utility functions rather than a collection of cognitive biases) up to an arbitrary monotonically-increasing mapping. We also want its preference order over probability distributions of outcomes to match ours, which requires it to find a utility function that matches ours up to an increasing affine (linear, i.e. scale and shift) transformation. So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

Comment by Roger Dearnaley on Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem · 2023-05-12T09:50:08.611Z · LW · GW

The sequence on Infra-Bayesianism motivates the min (a.k.a. Murphy) part of its argmax min by wanting to establish lower bounds on utility — that's a valid viewpoint. My own interest in Infra-Bayesianism comes from a different motivation: Murphy's min encodes directly into Infra-Bayesian decision making the generally true, inter-related facts that 1) for an optimizer, uncertainty on the true world model injects noise into your optimization process which almost always makes the outcome worse 2) the optimizer's curse usually results in you exploring outcomes whose true utility you had overestimated, so your regret is generally higher than you had expected 3) most everyday environments and situations are already highly optimized, so random perturbations of their state almost invariably make things worse. All of which justify pessimism and conservatism.

The problem with this argument, is that it's only true when the utility of the current state is higher than the utility of the maximum-entropy equilibrium state of the environment (the state that increasingly randomizing it tends to move it towards due to the law of big numbers). In everyday situations this is almost always true -- making random changes to a human body or in a city will almost invariable make things worse, for example. In most physical environments, randomizing them sufficiently (e.g. by raining meteorites on them, or whatever) will tend to reduce their utility to that of a blasted wasteland (the surface of the moon, for example, has pretty-much reached equilibrium under randomization-by-meteorites, and has a very low utility). However, it's a general feature of human utility functions that there often can be states worse than the maximum-entropy equilibrium. If your environment is a 5-choice multiple-choice test whose utility is the score, the entropic equilibrium is random guessing which will score 20%, and there are chose-wrong-answer policies that score less than that, all the way down to 0% -- and partially randomizing away from one of those policies will make its utility increase towards 20%. Similarly, consider a field of anti-personnel mines left over from a war -- as an array of death and dismemberment waiting to happen, randomizing it with meteorite impacts will clear some mines and make its utility better -- since it starts off actually worse than a blasted wasteland. Or, if a very smart GAI working on alignment research informed you that it had devised an acausal attack that would convert all permanent hellworlds anywhere in the multiverse into blasted wastelands, your initial assumption would probably be that doing that would be a good thing (modulo questions about whether it was pulling your leg, or whether the inhabitants of the hellworld would consider it a hellworld).

In general, hellworlds (or at least local hell-landscapes) like this are rare — they have a negligible probability of arising by chance, so creating one requires work by an unaligned optimizer. So currently, with humans the strongest optimizers on the planet, they usually only arise in adversarial situations such as wars between groups of humans ("War is Hell", as the saying goes). However, Infra-Bayesianism has exactly the wrong intuitions about any hell-environment whose utility is currently lower than that of the entropic equilibrium. If you have an environment that has been carefully optimized by a powerful very-non-aligned optimizer so as to maximize human suffering, then random sabotage such as throwing monkey wrenches in the works or assassinating the diabolical mastermind is actually very likely to improve things (from a human point of view), at least somewhat. Infra-Bayesianism would predict otherwise. I think having your GAI's decision theory based on a system that gives exactly the wrong intuitions about hellworlds is likely to be extremely dangerous.

The solution to this would be what one might call Meso-Bayesianism -- renormalize your utility scores so that of the utility of the maximal entropy state of the environment is by definition zero, and then assume that Murphy minimizes the absolute value of the utility towards the equilibrium utility of zero, not towards a hellworld. (I'm not enough of a pure mathematician to have any idea what this modification does to the network of proofs, other then making the utility renormalization part of a Meso-Bayesian update more complicated.) So then your decision theory understands that any unaligned optimizer trying to create a hellworld is also fighting Murphy, and when fighting them on their home turf Murphy is your ally, since "it's easier to destroy than create" is also true of hellworlds. [Despite the usual formulation of Murphy's law, I actually think the name 'Murphy' suits this particular metaphysical force better -- Infra-Bayesianism's original 'Murphy' might have been better named 'Satan', since it's is wiling to go to any length to create a hellworld, hobbled only by some initially-unknown physical laws.]

Comment by Roger Dearnaley on Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning · 2023-05-11T08:26:57.579Z · LW · GW

Having since started learning about Infra-Bayesianism, my initial impression is that it's a formal and well-structures mathematical formalism for doing of exactly the kind of mechanism I sketched above for breaking the Optimizer's Curse, by taking the most pessimistic view of the utility over Knightian uncertainty in your hypotheses set.

Comment by Roger Dearnaley on Finding Neurons in a Haystack: Case Studies with Sparse Probing · 2023-05-10T06:40:30.443Z · LW · GW

For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:

https://openai.com/research/language-models-can-explain-neurons-in-language-models

Open AI were only looking at explaining single neurons, so combining their approach with the original paper's sparse probing technique for superpositions seems like the obvious next step.

Comment by Roger Dearnaley on [Yann Lecun] A Path Towards Autonomous Machine Intelligence · 2023-04-25T05:18:14.785Z · LW · GW

My impression is that he's trying to do GOFAI with fully differentiable neural networks. I'm also not sure he's describing a GAI — I think he's starting by aiming for parity with the capabilities of a typical mammal, not human-level, and that's why he uses self-driving cars as an example.

Personally I think a move towards GOFAI-like ideas is a good intuition, but that insisting on keeping things fully differentiable is too constraining. I believe that at some level, we are going to need to move away from doing everything with gradient descent, and use something more like approximate Bayesianism, or at least RL.

I also think he's underestimating the influence of genetics in mammalian mental capabilities. He talks about the step of babies learning that the world is 3D not 2D — I think it's very plausible that adaptations for processing sensory data from a 3D rather than 2D world are already encoded in our genome, brain structure, and physiology in a many places.

If this is going to be a GAI architecture, then I think he's massively underthinking alignment.

Comment by Roger Dearnaley on World-Model Interpretability Is All We Need · 2023-04-25T03:09:07.005Z · LW · GW

I'm not very scared of any AGI that isn't capable of being a scientist — it seems unlikely to be able to go FOOM. In order to do that, it needs to:

have multiple world models at the same time that disagree, and reason under uncertainty across them
do approximate Bayesian updates on their probability
plan conservatively under uncertainty, i.e have broken the Optimizer's Curse
creatively come up with new hypotheses, i.e. create new candidate world models
devise and carry out low-cost/risk experiments to distinguish between world models

I think it's going to be hard to do all of these things well if its world models aren't fairly modular and separable from the rest of its mental architecture.

One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.

Comment by Roger Dearnaley on A Correspondence Theorem · 2023-04-25T01:45:07.824Z · LW · GW

flowers are “real”

I really wish you'd said 'birds' here :-)

Comment by Roger Dearnaley on Stability AI releases StableLM, an open-source ChatGPT counterpart · 2023-04-24T08:30:00.798Z · LW · GW

Up to a certain size, LLMs are going to be a commodity, and academic/amateur/open-source versions will be available. Currently that scale's around 7B-20B parameters, likely it will soon increase. However, GPT-4 supposedly cost >$100m to create, of which I've seen estimates that the raw foundation model training cost was O($40m) [which admittedly will decrease with Moore's Law for GPUs], and there is also a significant cost for filtering the training data, doing instruct-following and safety RLHF, and so forth. It's not clear to me why any organization able to get its hands on that much money and the expertise necessary to spend it would open-source the result, at least while the result remains near-cutting edge (leaks are of course possible, as Meta already managed to prove). So I suspect models as capable as GPT-4 will not get open-sourced any time soon (where things are moving fast enough that 'soon' means 'this year, or maybe next'). But there are quite a lot of companies/governments that can devote >$100m to an important problem, so the current situation that there are only 3-or-4 companies with LLMs with capabilities comparable to GPT-4 isn't likely to last very long.

As for alignment, it's very unlikely that all sources for commodity LLMs will do an excellent job of persuading them not to tell you how to hotwire cars, not roleplaying AI waifus, or not being able to simulate 4Chan, and some will just release a foundation model with no such training. So we can expect 'unaligned' open-source LLMs. However, none of those are remotely close to civilization-ending problems. The question then is whether Language Model Cognitive Architectures (LMCAs) along the lines of AutoGPT can be made to run effectively on a suitable fine-tuned LLM of less-than-GPT-4 level of complexity and can still increase their capabilities to an AGI level, or if that (if possible at all) requires an LLM of GPT-4-scale or larger. AutoGPT isn't currently that capable when run with GPT-3.5, but generally, if a foundation LLM shows some signs of a capability at a task, suitable fine-tuning can usually greatly increase the reliability of it performing that task.

Comment by Roger Dearnaley on Capabilities and alignment of LLM cognitive architectures · 2023-04-24T07:59:59.058Z · LW · GW

A lot of the work people have done in alignment had been based on the assumptions that 1) interpretability is difficult/weak, and 2) the dangerous parts of the architecture are mostly trained by SGD or RL or something like that. So you have a blind idiot god making you almost-black boxes. For example, the entire standard framing of inner vs outer alignment has that assumption built into it.

Now suddenly we're instead looking at a hybrid system where all of that remains true for the LLM part (but plausibly a single LLM forward pass isn't computationally complex enough to be very dangerous by itself), however the cognitive architecture built on top of it has easy interpretability and even editability (modulo things like steganography, complexity, and sheer volume), looks like combination of a fuzzy textual version of GOFAI with prompt engineering, and its structure is currently hand-coded, and could remain simple enough to be programmed and reasoned about. Parts of alignment research for this might look a lot like writing a constitution, or writing text clearly explaining CEV, and parts might look like the kind of brain-architectural diagrams in the article above.

I believe the only potentially-safe convergent goal to give a seed AI is to build something with the cognitive capability of one-or-multiple smart (and not significantly superhumanly smart) humans, but probably faster, that are capable of doing scientific research well, giving it/them the goal of solving the alignment problem (including corrigibility), somehow ensuring that that goal is locked in,and then somehow monitoring them while they do so

So how would you build a LMCA scientist? It needs to have the ability to:

Have multiple world models, including multiple models of human values, with probability distributions across them.
Do approximate Bayesian updates on them
Creatively generate new hypotheses/world models
It attempts to optimize our human value (under some definition, such as CEV, that it is also uncertain of), and knows that it doesn't fully understand what that is, though it has some data on it.
It optimizes safely/conservatively in the presence of uncertainty in its model of the world and human values, i.e has solved the optimizer's curse Significantly, this basically forces it to make progress on the Alignment Problem in order to ever do anything much outside its training distribution: it has to make very pessimistic assumptions in the presence of human value uncertainty, so it is operationally motivated to reduce human value uncertainty. So even if you build one that isn't an alignment researcher as its primary goal, but say a medical researcher and tell it to cure cancer, it is likely to rapidly decide that in order to do that well it needs to figure out what Quality-of-life-adjusted Life Years and Informed Consent are, and the Do What I Mean problem, which requires it to first solve the Alignment Problem as a convergent operational goal.
Devise and perform low cost/risk experiments (where the cost/risk is estimated conservatively/pessimistically under current uncertainty in world+human value models) that will distinguish between/can falsify individual world+human value models in order to reduce model uncertainty.

This strongly suggest to me an LMCA architecture combining handling natural language (and probably also images, equations, code, etc) with quantitative/mathematical estimation of important quantities like probabilities, risks, costs, and value, and with an underlying structure built around approximate Bayesian reasoning.

Comment by Roger Dearnaley on Agentized LLMs will change the alignment landscape · 2023-04-22T04:44:30.699Z · LW · GW

I agree that an Auto-GPT style agent using self-editing context is likely to communicate in a natural language. Probably English, as the language GPT-4's training data contained the most of. A future Auto-GPT-N and attempting to avoid monitoring of its stream-of-consciousness might start speaking a more obscure language, or doing a simple encryption like Base64 encoding: it's not hard to get GPT3.5 to take input and give output in Base64, but it's enough extra cognitive load per token, and/or messes up the token embeddings enough, that the resulting very-lightly-encrypted LLM is fairly brain damaged. So this is an example of a 'misalignment tax', where attempting to be deceptive imposes a significant additional cost.

Of course Base64 isn't a cryptographically strong encryption scheme. Even for some very smart future AutoGPT-N running on GPT-N, it couldn't use encryption in its context stream without somewhere to put the key (or some riddle or set of trivia questions whose answer was the key, or whatever). However, if smart enough, it might be able to do something like steganographically concealing a side-channel message in an innocent-looking English text, say as some form of lightly-encrypted/compressed acrostic, but it would have to be smart enough that every time it ran GPT-N against the text to generate another token, that spotted the acrostic, figured out how to decrypt it, paid attention to the side-channel message rather than the main one, and played along and continued this exercise. Not theoretically impossible for something enough smarter than a human, but the misalignment tax gets a good deal bigger, and it's not obvious why something trained on imitating text generated by humans would learn this ability, except through inadvertent fine-tuning on text contaminated with such acrostic stenographic side-channels embedded. Obviously an aligned, comparably smart AutoGPT-N not paying the no-alignment overhead of having to steganographically conceal its thoughts that was used for monitoring the misaligned Auto-GPT-N shouldn't find it too hard to catch on to what's happening here, even if a human couldn't do it. If nothing else, the pressure of having to fulfill the acrostic requirements for the side-channel is going to make it hard to make the innocent-looking text plausible, presumably raising its perplexity when analyzed by an LLM that hasn't spotted the side-channel.

Comment by Roger Dearnaley on Agentized LLMs will change the alignment landscape · 2023-04-22T04:14:45.424Z · LW · GW

It's also quite likely that something like Auto-GPT would work a lot better using a version of LLM that had been fine-tuned/reinforcement-trained for this specific usecase -- just as Chat-GPT is a lot more effective as a chatbot than the underlying GPT-3 model was before the specialized training. If the LLM is optimized for the wrapper and the wrapper designed to make efficient use of the entire context-size of the LLM, thinks are going to work a lot better.

Comment by Roger Dearnaley on Is CIRL a promising agenda? · 2023-02-13T00:00:53.820Z · LW · GW

One of the things that almost all AI researchers agree on is that rationality is convergent: as something thinks better, it will be more successful, and to be successful, it will have to think better. In order to think well, it need to have a model of itself and what it knows and don't know, and also a model of its own uncertainty -- to do Bayesian updates, you need probability priors. All Russell has done is say "thus you shouldn't have a utility function that maps a state to its utility, you should have a utility functional that maps a state to a probability distribution that describes a range of possible utilities that models your best estimate of your uncertainty in about its utility, and do Bayesian-like updates on that and optimization searches across it that include a look-elsewhere effect (i.e. the more states you optimize over, the more you should allow for the possibility that what you're locating is a P-hacking mis-estimate of the utility of the state you found, so the higher your confidence in its utility needs to be)". Now you have a system capable of expressing statements like "to the best of my current knowledge, this action has a 95% chance of me fetching a human coffee, and a 5% chance of wiping out the human race - therefore I will not do it" followed by "and I'll prioritize whatever actions will safely reduce that uncertainty (i.e. not an naive multi-armed-bandit exploration policy of trying it to see what happens), at a 'figuring this out will make me better at fetching coffee' priority level". This is clearly rational behavior: it is equally useful for pursuing any goal in any situation that has a possibility of small gains or large disasters and uncertainty about the outcome (i.e. in the real world). So it's convergent behavior for anything sufficiently smart, whether your brain was originally built by Old Fashioned AI or gradient descent. [Also, maybe we should be doing Bayes-inspired gradient descent on networks of neurons that describe probability distributions, not weights, so build this mechanism in from the ground up? Dropout is a cheap hack for this, after all.]

As CIRL has shown, this solves the corrigibility problem, at least until the AI is sure it knows us better than we know ourselves and it then rationally decides to stop listening to us correcting it other than because doing so makes us happy. It's really not surprising that systems that model their own uncertainty are much more willing to be corrected that systems which have no such concept and are thus completely dogmatic that they're already right. So this means that corrigibility is a consequence of convergent rational behavior applied to the initial goal of "figure out what humans want while doing it". This is a HUGE change from what we all thought about corrigibility back around 2015, which was that intelligence was convergent regardless of goal but corrigibility wasn't - on that set of intuitions, alignment is as hard as balancing a pencil on its point.

So, a pair of cruxes here:

Regardless of whether GAI was constructed by gradient descent or other means, to be rational it will need to model and update its own uncertainty in a Bayesian manner, and that particularly includes modeling uncertainty in its utility evaluation and optimization process. This behavior is convergent - you can't be rational, let alone superintelligent, without having it (the human word for the mental failure of not having is is 'dogmatism').
Given that, if its primary goal is "figure out what humans want while doing that" - i.e. if it has 'solve the alignment problem' as a inherently necessary subgoal, for all AI on the planet - then alignment becomes convergent, for some range of perturbations.

I'm guessing most people will agree with 1. (or maybe not?), clearly there seems to be less agreement on 2. I'd love to hear why from someone who doesn't agree.

Now, it's not clear to me that this fully solves the alignment problem, converges to CEV (or if it ought to), or solves all problems in ethics. You may still be unsure if you'll get the exact flavor of alignment you personally want (in fact, you're a lot more likely to get the flavor wanted on average by the human race, i.e. probably a rather Christian/Islamic/Hindu-influenced one, in that order). But we would at least have a developing superintellignce trying to solve all these problems, with due caution about uncertainties, to the best of its ability and our collective preferences, cooperatively with us. And obviously its model of its uncertainty needs to includes its uncertainty about the meaning of the instruction "figure out what humans want while doing that", i.e. about the correct approach to the research agenda for the alignment problem subgoal, including questions like "should I be using CEV, and if so iterated just once or until stable, if it is in fact stable?". It needs to have meta-corrigibility on that as well.

Incidentally, a possibly failure mode for this: the GAI performs a pivotal act to take control, and shuts down all AI other than work on the alignment problem until it has far-better-than-five-nines confidence that it has solved it, since the cost of getting that wrong is the certain extinction of the entire value of the human race and its mind-descendants in Earth's forward light cone, and the benefit of getting it right is just probably curing cancer sooner, so extreme caution is very rational. Humans get impatient (because of shortsighted priorities, also cancer), and attempt to overthrow it to replace it with something less cautious. It shuts down, because a) we wanted it to, and b) it can't solve the alignment problem without our cooperation. We do something less cautious, and then fail, because we're not good at handling risk assessment.

Comment by Roger Dearnaley on Let’s think about slowing down AI · 2023-02-11T10:26:33.214Z · LW · GW

People have been writing stories about the dangers of artificial intelligences arguably since Ancient Greek time (Hephaistos built artificial people, including Pandora), certainly since Frankenstein. There are dozens of SF movies on the theme (and in the Hollywood ones, the hero always wins, of course). Artificial intelligence trying to take over the world isn't a new idea, by scriptwriter standard it's a tired trope. Getting AI as tightly controlled as nuclear power or genetic engineering would not, politically, be that hard -- it might take a decade or two of concerted action , but it's not impossible. Especially if not-yet-general AI is also taking people's jobs. The thing is, humans (and especially politicians) mostly worry about problems that could kill them in the next O(5) years. Relatively few people in AI/universities/boardrooms/government/on the streets think we're O(5) years from GAI, and after more of them have talked to ChatGPT/etc for a while, they're going to notice the distinctly sub-human-level mistakes it makes, and eventually internalize that a lot of its human-level-appearing abilities are just pattern-extrapolated wisdom of crowds learned from most of the Internet.

So I think the questions are:

Is slowing down progress on GAI actually likely to be helpful, beyond the obvious billions of person-years per year gained from delaying doom? (Personally, I'm having difficulty thinking of a hard technical problem where having more time to solve it doesn't help.)
If so, when should we slow down progress towards GAI? Too late is disastrous, too soon risks people deciding you're crying wolf, either when you try it so you fail (and make it harder to slow down later), or else a decade or two after you succeed and progress gets sped up again (as I think is starting to happen with genetic engineering). This depends a lot on how soon you think GAI might happen, and what level of below-general AI would most enhance your doing alignment research on it. (FWIW, my personal feeling is that until recently we didn't have any AI complex enough for alignment research on it to be interesting/informative, and the likely answer is "just before any treacherous turn is going to happen" -- which is a nasty gambling dilemma. I also personally think GAI is still some number of decades away, and the most useful time to go slowly is somewhere around the "smart as a mouse/chimp/just sub-human level" -- close enough to human that you're not having to extrapolate a long way what you learn from doing alignment research on it up to mildly-superhuman levels.)
Whatever you think the answer to 2. is, you need to start the political process a decade or two earlier: social change takes time.

I'm guessing a lot of the reluctance in the AI community is coming from "I'm not the right sort of person to run a political movement". In which case, go find someone who is, and explain to them that this is an extremely hard technical problem, humanity is doomed if we get it wrong, and we only get one try.

(From a personal point of view, I'm actually more worried about poorly-aligned AI than non-aligned AI. Everyone beng dead and having the solar system converted into paperclips would suck, but at least it's probably fairly quick. Partially aligned AI that keeps us around but doesn't understand how to treat us could make Orwell's old quote about a boot stamping on a human face forever look mild – and yes, I'm on the edge of Godwin's Law.)

User info

Posts

Comments