Posts
Comments
Yes. Current AI policy is like people in a crowded room fighting over who gets to hold a bomb. It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.
That said, we're currently not near any satisfactory solutions to corrigibility. And I do think it would be better for the world if were easier (by some combination of technical factors and societal factors) to build AI that works for the good of all humanity than to build equally-smart AI that follows the orders of a single person. So yes, we should focus research and policy effort toward making that happen, if we can.
And if we were in that world already, then I agree releasing all the technical details of an AI that follows the orders of a single person would be bad.
One way of phrasing the AI alignment task is to get AIs to “love humanity” or to have human welfare as their primary objective (sometimes called “value alignment”). One could hope to encode these via simple principles like Asimov’s three laws or Stuart Russel’s three principles, with all other rules derived from these.
I certainly agree that Asimov's three laws are not a good foundation for morality! Nor are any other simple set of rules.
So if that's how you mean "value alignment," yes let's discount it. But let me sell you on a different idea you haven't mentioned, which we might call "value learning."[1]
Doing the right thing is complicated.[2] Compare this to another complicated problem: telling photos of cats from photos of dogs. You cannot write down a simple set of rules to tell apart photos of cats and dogs. But even though we can't solve the problem with simple rules, we can still get a computer to do it. We show the computer a bunch of data about the environment and human classifications thereof, have it tweak a bunch of parameters to make a model of the data, and hey presto, it tells cats from dogs.
Learning the right thing to do is just like that, except for all the ways it's different that are still open problems:
- Humans are inconsistent and disagree with each other about the right thing more than they are inconsistent/disagree about dogs and cats.
- If you optimize for doing the right thing, this is a bit like searching for adversarial examples, a stress test that the dog/cat classifier didn't have to handle.
- When building an AI that learns the right thing to do, you care a lot more about trust than when you build a dog/cat classifier.
This margin is too small to contain my thoughts on all these.
There's no bright line between value learning and techniques you'd today lump under "reasonable compliance." Yes, the user experience is very different between (e.g.) an AI agent that's operating a computer for days or weeks vs. a chatbot that responds to you within seconds. But the basic principles are the same - in training a chatbot to behave well you use data to learn some model of what humans want from a chatbot, and then the AI is trained to perform well according to the modeled human preferences.
The open problems for general value learning are also open problems for training chatbots to be reasonable. How do you handle human inconsistency and disagreement? How do you build trust that the end product is actually reasonable, when that's so hard to define? Etc. But the problems have less "bite," because less can go wrong when your AI is briefly responding to a human query than when your AI is using a computer and navigating complicated real-world problems on its own.
You might hope we can just say value learning is hard, and not needed anyhow because chatbots need it less than agents do, so we don't have to worry about it. But the chatbot paradigm is only a few years old, and there is no particular reason it should be eternal. There are powerful economic (and military) pressures towards building agents that can act rapidly and remain on-task over long time scales. AI safety research needs to anticipate future problems and start work on them ahead of time, which means we need to be prepared for instilling some quite ambitious "reasonableness" into AI agents.
- ^
For a decent introduction from 2018, see this collection.
- ^
Citation needed.
Yeah, that's true. I expect there to be a knowing/wanting split - AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of "alignment", or make other long-term predictions, but that doesn't mean it's using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible.
In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less "we need to hurry and use the AI to do our work for us" and more "we're executing a shared human-AI gameplan for learning human values that are good by human standards."
In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining "what humans want" in a way that captures many of the 'advantages' of deception for maximizing reward without triggering our interpretability tools that are looking for deception.)
I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren't defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.
First, I agree with Dmitry.
But it does seem like maybe you could recover a notion of information bottleneck even with out the Bayesian NN model. If you quantize real numbers to N-bit floating point numbers, there's a very real quantity which is "how many more bits do you need to exactly reconstruct X, given Z?" My suspicion is that for a fixed network, this quantity grows linearly with N (and if it's zero at 'actual infinity' for some network despite being nonzero in the limit, maybe we should ignore actual infinity).
But this isn't all that useful, it would be nicer to have an information that converges. The divergence seems a bit silly, too, because it seems silly to treat the millionth digit as as important as the first.
So suppose you don't want to perfectly reconstruct X. Instead, maybe you could say the distribution of X is made of some fixed number of bins or summands, and you want to figure out which one based on Z. Then you get a converging amount of information, and you correctly treat small numbers as less important, but you've had to introduce this somewhat arbitrary set of bins. shrug
A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states.
I wonder if this can be resolved by treating the randomness of the machines quantum mechanically, rather than having this semi-classical picture where you start with some randomness handed down from God. Suppose these machines use quantum mechanics to do the randomization in the simplest possible way - they have a hidden particle in state |left>+|right> (pretend I normalize), they mechanically measure it (which from the outside will look like getting entangled with it) and if it's on the left they emit their first option (|0> or |+> depending on the machine) and vice versa.
So one system, seen from the outside, goes into the state |L,0>+|R,1>, the other one into the state |L,0>+|R,0>+|L,1>-|R,1>. These have different density matrices. The way you get down to identical density matrices is to say you can't get the hidden information (it's been shot into outer space or something). And then when you assume that and trace out the hidden particle, you get the same representation no matter your philosophical opinion on whether to think of the un-traced state as a bare state or as a density matrix. If on the other hand you had some chance of eventually finding the hidden particle, you'd apply common sense and keep the states or density matrices different.
Anyhow, yeah, broadly agree. Like I said, there's a practical use for saying what's "real" when you want to predict future physics. But you don't always have to be doing that.
people who study very "fundamental" quantum phenomena increasingly use a picture with a thermal bath
Maybe talking about the construction of pointer states? That linked paper does it just as you might prefer, putting the Boltzmann distribution into a density matrix. But of course you could rephrase it as a probability distribution over states and the math goes through the same, you've just shifted the vibe from "the Boltzmann distribution is in the territory" to "the Boltzmann distribution is in the map."
Still, as soon as you introduce the notion of measurement, you cannot get away from thermodynamics. Measurement is an inherently information-destroying operation, and iiuc can only be put "into theory" (rather than being an arbitrary add-on that professors tell you about) using the thermodynamic picture with nonunitary operators on density matrices.
Sure, at some level of description it's useful to say that measurement is irreversible, just like at some level of description it's useful to say entropy always increases. Just like with entropy, it can be derived from boundary conditions + reversible dynamics + coarse-graining. Treating measurements as reversible probably has more applications than treating entropy as reversible, somewhere in quantum optics / quantum computing.
Some combination of:
- Interpretability
- Just check if the AI is planning to do bad stuff, by learning how to inspect its internal representations.
- Regularization
- Evolution got humans who like Doritos more than health food, but evolution didn't have gradient descent. Use regularization during training to penalize hidden reasoning.
- Shard / developmental prediction
- Model-free RL will predictably use simple heuristics for the reward signal. If we can predict and maybe control how this happens, this gives us at least a tamer version of inner misalignment.
- Self-modeling
- Make it so that the AI has an accurate model of whether it's going to do bad stuff. Then use this to get the AI not to do it.
- Control
- If inner misalignment is a problem when you use AI's off-distribution and give them unchecked power, then don't do that.
Personally, I think the most impactful will be Regularization, then Interpretability.
The real chad move is to put "TL;DR: See above^" for every section.
When you say there's "no such thing as a state," or "we live in a density matrix," these are statements about ontology: what exists, what's real, etc.
Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by "explain with ontology."
Epistemology is about how we know stuff. If we start with a world that does not inherently have a probability distribution attached to it, but obtain a probability distribution from arguments about how we know stuff, that's "explain with epistemology."
In quantum mechanics, this would look like talking about anthropics, or what properties we want a measure to satisfy, or solomonoff induction and coding theory.
What good is it to say things are real or not? One useful application is predicting the character of physical law. If something is real, then we might expect it to interact with other things. I do not expect the probability distribution of a mixed state to interact with other things.
Treating the density matrix as fundamental is bad because you shouldn't explain with ontology that which you can explain with epistemology.
Be sad.
For topological debate that's about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the "strictest" combination, a big class of fatal flaw would be if you don't actually have the partial order you think you have within the practical range of the settings - i.e. if some settings you thought were more accurate/strict are actually systematically less accurate.
In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long as they're rare, this is pretty easy to defend against.
In the fine-grained plane example, though, there's a lot more room for fine-grained patterns in which parts of the plane get modeled at which length scale to start having nonlinear effects. If the agents are not allowed to bid "maximum resolution across the entire plane," and instead are forced to allocate resources cleverly, then maybe you have a problem. But hopefully the truth is still advantaged, because the false player has to rely on fairly specific correlations, and the true player can maybe bid a bunch of noise that disrupts almost all of them.
(This makes possible a somewhat funny scene, where the operator expected the true player's bid to look "normal," and then goes to check the bids and both look like alien noise patterns.)
An egregious case would be where it's harder to disrupt patterns injected during bids - e.g. if the players' bids are 'sparse' / have finite support and might not overlap. Then the notion of the true player just needing to disrupt the false player seems a lot more unlikely, and both players might get pushed into playing very similar strategies that take every advantage of the dynamics of the simulator in order to control the answer in an unintended way.
I guess for a lot of "tough real world questions," the difficulty of making a super-accurate simulator (one you even hope converges to the right answer) torpedoes the attempt before we have to start worrying about this kind of 'fatal flaw'. But anything involving biology, human judgment, or too much computer code seems tough. "Does this gene therapy work?" might be something you could at least imagine a simulator for that still seems like it gives lots of opportunity for the false player.
Fun post, even though I don't expect debate of either form to see much use (because resolving tough real world questions offers too many chances for the equivalent of the plane simulation to have fatal flaws).
With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I'm not at all confused about why people would think this is useful safety work, even if I haven't personally hashed out the cost/benefit to any degree of confidence.
I'm mostly confused about ML / SWE / research benchmarks.
The mathematical structure in common is called a "measure."
I agree that there's something mysterious-feeling about probability in QM, though I mostly think that feeling is an illusion. There's a (among physicists) famous fact that the only way to put a 'measure' on a wavefunction that has nice properties (e.g. conservation over time) is to take the amplitude squared. So there's an argument: probability is a measure, and the only measure that makes sense is the amplitude-squared measure, therefore if probability is anything it's the amplitude squared. And it is! Feels mysterious.
But after getting more used to anthropics and information theory, you start to accumulate more arguments for the same thing that take it from a different angle, and it stops feeling so mysterious.
Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?
It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear.
But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.
But clearly other people think differently than me.
One big reason I might expect an AI to do a bad job at alignment research is if it doesn't do a good job (according to humans) of resolving cases where humans are inconsistent or disagree. How do you detect this in string theory research? Part of the reason we know so much about physics is humans aren't that inconsistent about it and don't disagree that much. And if you go to sub-topics where humans do disagree, how do you judge its performance (because 'be very convincing to your operators' is an objective with a different kind of danger).
Another potential red flag is if the AI gives humans what they ask for even when that's 'dumb' according to some sophisticated understanding of human values. This could definitely show up in string theory research (note when some ideas suggest non-string-theory paradigms might be better, and push back on the humans if the humans try to ignore this), it's just intellectually difficult (maybe easier in loop quantum gravity research heyo gottem) and not as salient without the context of alignment and human values.
Thanks for the great reply :) I think we do disagree after all.
humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans
Except about that - here we agree.
Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values.
This might be summarized as "If humans are inaccurate, let's strive to make them more accurate."
I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren't actually that consistent, even in what we'd consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of 'accuracy' with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).
Instead, I think our strategy should be "If humans are inconsistent and disagree, let's strive to learn a notion of human values that's robust to our inconsistency and disagreement."
We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.
A committee of humans reviewing an AI's proposal is, ultimately, a physical system that can be predicted. If you have an AI that's good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI's decision-making.)
I basically think your sixth to last (or so) bulllet point is key - an AI that takes over is likely to be using a lot more RL on real world problems, i.e. drawn from a different distribution than present-day AI. This will be worse for us than conditioning on a present-day AI taking over.
Cool stuff!
I'm a little confused what it means to mean-ablate each node...
Oh, wait. ctrl-f shows me the Non-Templatic data appendix. I see, so you're tracking the average of each feature, at each point in the template. So you can learn a different mask at each token in the template and also learn a different mean (and hopefully your data distribution is balanced / high-entropy). I'm curious - what happens to your performance with zero-ablation (or global mean ablation, maybe)?
Excited to see what you come up with for non-templatic tasks. Presumably on datasets of similar questions, similar attention-control patterns will be used, and maybe it would just work to (somehow) find which tokens are getting similar attention, and assign them the same mask.
It would also be interesting to see how this handles more MLP-heavy tasks like knowledge questions. maybe someone clever can find a template for questions about the elements, or the bibliographies of various authors, etc.
No, you're right that aristocracy is more complicated. There were lots of pressures that shaped the form of it. Certainly more than how good of managers aristocrats made!
An invalid syllogism: "The rules of aristocracy were shaped by forces. Avoiding poor management is a force. Therefore, the rules of aristocracy will be all about avoiding poor management."
Aristocrats were also selected for how well they could extract rents from those below, and how well they could resist rent-extraction from above, both alone and collectively. Nor was the top-down pressure all about making aristocrats into productive managers - rent-extraction has been mentioned, and also weakening the aristocracy to secure central power, allowing advancement via marriage and alliance, various human status games, and the need for a legislative arm of government.
I don't want to hear the One Pressure That Explains Everything (but only qualitatively, and if you squint). I'll want to hear when they have the dozen pressures that make up a model that can be quantitatively fit to past data by tuning some parameters, including good retrodictive accuracy over a held-out time period.
I think if you want to go fast, and you can eat the rest of the solar system, you can probably make a huge swarm of fusion reactors to help blow matter off the sun. Let's say you can build 10^11-watt reactors that work in space. Then you need about 10^15 of them to match the sun. If each is 10^6 kg, this is about 10^-4 of Mercury's mass.
I was expecting (Methods start 16:00)
When you find a fence in a field, someone once built that fence on purpose and had a reason for it. So it's good sense to ask after that reason, and guess ahead of time that it might be worth a fence, to the owner of the field.
When you find a rock in a field, probably nobody put that rock there on purpose. And so it's silly to go "What is the reason this rock was put here? I might now know now, but I can guess ahead of time it might be worth it to me!"
I agree it's a good point that you don't need the complexity of the whole world to test ideas. With a fairly small in terms of number of states, you can encode interesting things in a long sequence of states so long as the generating process is sufficiently interesting. And adding more states is itself no virtue if it doesn't help you understand what you're trying to test for.
Some out-of-order thoughts:
- Testing for 'big' values, e.g. achievement, might require complex environments and evaluations. Not necessarily large state spaces, but the complexity of differentiating between subtle shades of value (which seems like a useful capability to be sure we're getting) has to go somewhere.
- Using more complicated environments that are human-legible might better leverage human feedback and/or make sense to human observers - maybe you could encode achievement in the actions of a square in a gridworld, but maybe humans would end up making a lot of mistakes when trying to judge the outcome. If you want to gather data from humans, to reflect a way that humans are complicated that you want to see if an AI can learn, a rich environment seems useful. On the other hand, if you just want to test general learning power, you could have a square in a gridworld have random complex decision procedures and see if they can be learned.
There's a divide between contexts where humans are basically right, and so we just want an AI to do a good job of learning what we're already doing, and contexts where humans are inconsistent, or disagree with each other, where we want an AI to carefully resolve these inconsistencies/disagreements in a way that humans endorse (except also sometimes we're inconsistent or disagree about our standards for resolving inconsistencies and disagreements!).
Building small benchmarks for the first kind of problem seems kind of trivial in the fully-observed setting where the AI can't wirehead. Even if you try to emulate the partial observability of the real world, and include the AI being able to eventually control the reward signal as a natural part of the world, it seems like seizing control of the reward signal is the crux rather than the content of what values are being demonstrated inside the gridworld (I guess it's useful to check if the content matters, I just don't expect it to), and a useful benchmark might be focused on how seizing control of the reward signal (or not doing so) scales to the real world.
Building small benchmarks for the latter kind of problem seems important. The main difficulty is more philosophical than practical - we don't know what standard to hold the benchmarks to. But supposing we had some standard in mind, I would still worry that a small benchmark would be more easily gamed, and more likely to miss some of the ways humans are inconsistent or disagree. I would also expect benchmarks of this sort, whatever the size, to be a worse fit for normal RL algorithms, and run into issues where different learning algorithms might request different sorts of interaction with the environment (although this could be solved either by using real human feedback in a contrived situation, or by having simulated inhabitants of the environment who are very good at giving diverse feedback).
Honestly I think this is still too optimistic. Humans are not consistent economic actors - we can be persuaded of new things even if those things are subjective, will sometimes take deals we might in other circumstances call unfavorable, and on an absolute scale aren't all that bright. Owning capital does not fix this, and so an AI that's good at interacting with humans will be able to get more from us than you might expect just looking at the current economy.
As Sean Carroll likes to say, though, the reason we've made so much progress in physics is that it's way easier than the other sciences :)
Voluntary interaction has been great for humans. But it hasn't been great for orangutans, who don't do a very good job of participating in society.
Even if you somehow ensure transparency and cooperation among superintelligent AIs and humans, it seems overwhelmingly likely that humans will take the place of the orangutan, marginalized and taken from in every way possible within the limits of what is, in the end, not a very strict system. It is allowed, as Eliezer would say.
Orangutans don't contribute to human society even though they're specialized in things humans aren't. The best chess player in the world isn't a human-AI symbiote, for the same reason it's not an orangutan-human-AI symbiote.
Human trades with superintelligent AI do not have to be Pareto improvements (in the common-sense way), because humans make systematic mistakes (according to the common-sense standard). If you actually knew how to detect what trades would be good for humans - how to systematize that common sense, and necessarily also how to improve it since it is itself inconsistent and systematically mistaken - this would be solving the key parts of the value alignment problem that one might have hoped to sidestep by relying on voluntarism instead.
I'm not excited by gridworlds, because they tend to to skip straight to representing the high-level objects we're supposed to value, without bothering to represent all the low-level structure that actually lets us learn and generalize values in the real world.
Do you have plans for how to deal with this, or plans to think about richer environments?
Because 'alignment' is used in several different ways, I feel like these days one either needs to asterisk in a definition (e.g. "By 'alignment,' I mean the AI faithfully carrying out instructions without killing everyone."), or just use a more specific phrase.
I agree that instruction-following is not all you need. Many of these problems are solved by better value-learning.
I strongly suspect that if you try to set the regularization without checking how well it does, you'll either get an unintelligent policy that's extrordinarily robust, or you'll get wireheading with error-correction (if wireheading was incentivized without the regularization).
https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
I sometimes come back to think about this post. Might as well write a comment.
Goodhart's law. You echo the common frame that an approximate value function is almost never good enough, and that's why Goodhart's law is a problem. Probably what I though when I first read this post was that I'd just written a sequence about how human values live inside models of humans (whether our own models or an AI's), which makes that frame weird - weird to talk about an 'approximate value function' that's not really an approximation to anything specific. The Siren Worlds problem actually points towards more meat - how do we want to model preferences for humans who are inconsistent, who have opinions about that inconsistency, who mistrust themselves though even that mistrust is imperfect?
You say basically all this at various points in the post, so I know it's kind of superficial to talk about the initial framing. But to indulge my superficiality for a while, I'm curious about how it's best to talk about these things (a) conveniently and yet (b) without treating human values as a unique target out there to hit.
In physics pedagogy there's kind of an analogous issue, where intro QM is designed to steer students away from thinking in terms of "wave-particle duality" - which many students have heard about and want to think in terms of - by just saturating them with a frame where you think in terms of wave functions that sometimes give probability distributions that get sampled from (by means left unspecified).
My inclination is to do the same thing to the notion "fixed, precise human values," which are a convenient way to think about everyday life and which many people want to think of value learning in terms of. I'd love to know a good frame to saturate the initial discussion of amplified human values, identifiability, etc. with that would introduce those topics as obviously a result of human values being very "fuzzy" and also humans having self-reflective opinions about how they want to be extrapolated.
Helpers / Helpees / Ghosts section. A good section :)
I don't think we have to go to any lengths to 'save' the ghosts example by supposing that a bunch of important values rest on the existence of ghosts. A trivial action (e.g. lighting incense for ghosts) works just was well, or maybe even no action, just a hope that the AI could do something for the ghosts.
It does seem obvious at first that if there are no ghosts, the AI should not light incense for them. But there's some inherent ambiguity between models of humans that light incense for the sake of the ghosts, and models of humans that light incense for the sake of cultural conformity, and models of humans that light incense because they like incense. Even if the written text proclaims that it's all for the ghosts, since there are no ghosts there must be other explanations for the behavior, and maybe some of those other explanations are at least a little value-shaped. I agree that what courses of action are good will end up depending on the details.
Maybe you get lured in by the "fixed, precise human values" frame here, when you talk about the AI knowing precisely how the human's values would update upon learning there are no ghosts. Precision is not the norm from which needing to do the value-amplification-like reasoning is a special departure, the value-amplification-like reasoning is the norm from which precision emerges in special cases.
Wireheading. I'm not sure time travel is actually a problem?
Or at least, I think there are different ways to think of model-based planning with modeled goals, and the one in which time travel isn't a problem seems more natural way.
The way to do model-based planning with modeled goals in which time travel is a problem is: you have spread-out-in-time model of the world that you can condition on your different actions, and first you condition it on that action "time travel to a century ago and change human values to be trivially satisfied" and then you evaluate how well the world is doing according to the modeled function "human values as of one second ago, conditional on the chosen action."
The way to do the planning in which time travel isn't a problem is: you have a model of the world that tracks current and past state, plus a dynamics model that you can use to evolve the state conditional on different actions. The human values you use to evaluate actions are part of the unconditioned present state, never subjected to the dynamics.
On the other hand, this second way does seem like it's making more, potentially unnecessary, commitments for the AI - if time travel is possible, what even is its dynamics model supposed to say is happening to the state of the universe? Humans have the exact same problem - we think weird thoughts like "after I time traveled, smallpox was eradicated sooner," which imply the silly notion that the time travel happened at some time in the evolution of the state of the universe. Or are those thought so silly after all? Maybe if time travel is possible in the way normally understood, we should be thinking of histories of computations rather than histories of universes, and the first sort of AI is actually making a mistake by erasing histories of computation.
Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren't simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
It would be very convenient if only undesired plans were difficult and precise, while only desired plans were error-tolerant. I don't think this is the case in the real world - it's hard to dodge sifting desired from undesired plans based on semantic content.
Some survey articles:
https://arxiv.org/abs/2306.05126
https://arxiv.org/pdf/2001.07092
The difference is that the weights are not initialised with random values at birth (or at the embryo stage, to be more precise).
The human cortex (the part we have way more of than chimps) is initialized to be made of a bunch of cortical column units, with slowly varying properties over the surface of the brain. But there's decent evidence that there's not much more initialization than that, and that that huge fraction of the brain has to slowly pick up knowledge within the human lifetime before it starts being useful, e.g. https://pmc.ncbi.nlm.nih.gov/articles/PMC9957955/
Or you could think about it like our DNA has on the order of a megabyte to spend on the brain, and the adult brain has on the order of a terabyte of information. So 99.99[..]% of the information in the adult brain comes from the learning algorithm, not the initialization.
How much predictive power does this analogy have as per you personally?
Yeah, it's way more informative than the evolution analogy to me, because I expect human researchers + computers spending resources designing AI to be pretty hard to analogize to evolution, but learning within AI to be within a few orders of magnitude on various resources to learning within a brain's lifetime.
Nice! Purely for my own ease of comprehension I'd have liked a little more translation/analogizing between AI jargon and HCI jargon - e.g. the phrase "active learning" doesn't appear in the post.
- Value Alignment: Ultimately, humans will likely need to continue to provide input to confirm that AI systems are indeed acting in accordance with human values. This is because human values continue to evolve. In fact, human values define a “slice” of data where humans are definitionally more accurate than non-humans (including AI). AI systems might get quite good at predicting what aligned behavior should be in out-of-distribution scenarios, but it’s unlikely that AI will be able to figure out what humans want in completely new situations without humans being consulted and kept in the loop.
I disagree in several ways.
- Humans being definitionally accurate is a convenient assumption on easy problems, but breaks down on hard problems. The thing is, human responses to questions are not always best thought of as direct access to some underlying truth - we give different answers in different contexts, and have to be interpreted in sometimes quite sophisticated ways to turn those question-responses into good choices between actions in the real world. There are even cases where humans will give some answer on the object level, but when asked a meta-level question about their object-level answer will disagree with themselves (perhaps even endorsing some non-human process including AI). If humans were always accurate this would be a paradox.
- AI is going to get smart. Eventually, quite smart. Smart enough to predict human behavior in new contexts kind of smart. On the one hand this is good news because it means that if we can reduce moral questions to empirical questions about the behavior of humans in novel contexts (and I mean do it in a philosophically satisfying way, not just try something that sounds good and hope it works), we're almost done. On the other hand this is bad news because it means that AI ignorance about human behavior cannot be used to ensure properties like corrigibility, and predictions of future AI-human interaction based on assumptions of AI ignorance have to be abandoned.
I think this isn't quite the right framing, for similar reasons to how symbolic reasoning was all the rage in AI for decades despite basically being a dead end outside of simple domains. We see the CoT use language skillfully, like how a human would use it, and it's intuitive to think that the language is how/where the reasoning is being done, that "less of the computation is done inside the black box and more is done in human-readable tokens." But that's not strictly accurate - the human-readable CoT is more like a projection of the underlying reasoning - which is still as black-box as ever - into the low-dimensional token space.
I'm reminded of the recent paper about fine-tuning Claude to be 'bad', and how one of the the things that happened to their CoT model was that the training caused it to give bullshit explanations in the CoT about why it was a good idea to do the bad thing this time, and then do the bad thing. If you didn't know a priori that the thing you were finetuning for was bad - e.g. if your dataset incentivized deceiving the human but you weren't aware it had that vulnerability - there's no law that says your CoT has to alert you that the model is being bad, the projection of the perturbation to the latent reasoning process into token space will plausibly just look to you like "it says some bullshit and then does the bad thing."
I think the history of things being predicted Real Soon Now is one of the main counterarguments to short timelines. It just seemed Obvious that we were getting flying cars, or fusion power, or self-driving cars, or video-phones, for years, before in some cases we eventually did get those things, and in other cases maybe we'll never get those things because technology just followed a different path than we expected.
Like, maybe the "we'll just merge with the machines" people will turn out to actually be right. I don't believe it. But it could happen, and there are plenty of similar things that "could happen" that eventually add up to a nontrivial chunk of probability.
In the strongest sense, neither the human brain analogy nor the evolution analogy really apply to AI. They only apply in a weaker sense where you are aware you're working with analogy, and should hopefully be tracking some more detailed model behind the scenes.
The best argument to consider human development a stronger analogy than evolutionary history is that present-day AIs work more like human brains than they do like evolution. See e.g. papers finding that you can use a linear function to translate some concepts between brain scans and internal layers in a LLM, or the extremely close correspondence between ConvNet feature and neurons in the visual cortex. In contrast, I predict it's extremely unlikely that you'll be able to find a nontrivial correspondence between the internals of AI and evolutionary history or the trajectory of ecosystems or similar.
Of course, just because they work more like human brains after training doesn't necessarily mean they learn similarly - and they don't learn similarly! In some ways AI's better (backpropagation is great, but it's basically impossible to implement in a brain), in other ways AI's worse (biological neurons are way smarter than artificial 'neurons'). Don't take the analogy too literally. But most of the human brain (the neocortex) already learns its 'weights' from experience over a human lifetime, in a way that's not all that different from self-supervised learning if you squint.
I agree, after all RLFH was originally for RL agents. As long as the models aren't all that smart, and the tasks they have to do aren't all that long-term, the transfer should work great, and the occasional failure won't be a problem because, again, the models aren't all that smart.
To be clear, I don't expect a 'sharp left turn' so much as 'we always implicitly incentivized exploitation of human foibles, we just always caught it when it mattered, until we didn't.'
Oh dear, RL for everything, because surely nobody's been complaining about the safety profile of doing RL directly on instrumental tasks rather than on goals that benefit humanity.
It likely means running the AI many times and submitting the most common answer from the AI as the final answer.
Skimmed ahead to the alignment section, since that's my interest. Some thoughts:
My first reaction to the perennial idea of giving everyone their own AI advocate is that this is a bargaining solution that could just as well be simulated within a single centralized agent, if one actually knew how to align AIs to individual people. My second typical reaction is that if one isn't actually doing value alignment to individuals, and is instead just giving people superpowerful AI assistants and seeing what happens, that seems like an underestimation of the power of superintelligence to cause winner-take-all dynamics in contexts where there are resources waiting to be grabbed, or where the laws of nature happen to favor offense over defense.
You anticipate these thoughts, which is nice!
My typical reaction to things like AI Rules is that they essentially have to contain a solution to the broad/ambitious value alignment problem anyway in order to work, so why not cut out the middlemen of having mini-'goals' overseen by aligned-goal-containing Rules and just build the AI that does what's good.
You agree with the first part. I think where you disagree with the second part is that you think that if we oversee the AIs and limit the scope of their tasks, we can get away with leaky or hacky human values in the Rules in a way we couldn't get away with if we tried to just directly get an AI to fulfill human values without limitations. I worry that this still underestimates superintelligence - even tame-seeming goals from users can test all the boundaries you've tried to set, and any leaks ('Oh, I can't lie to the human, but I can pick a true thing to say which I predict will get the outcome that I think fulfills the goal, and I'm very good at that') in those boundaries will be flowed through in a high-pressure stream.
If there's an assumption I skimmed past that the AI assistants we give everyone won't actually be very smart, or will have goals restricted to the point that it's hard for superintelligence to do anything useful, I think this puts the solution back into the camp of "never actually use a superintelligent AI to make real-world plans," which is nice to aspire to but I think has a serious human problem, and anyhow I'm still interested in alignment plans that work on AI of arbitrary intelligence.
What a nice hopeful story about how good will always triumph because evil is dumb (thanks Spaceballs).
Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I'll drop this, because maintaining state or not seems somewhat irrelevant.
(1) Yep, current LLM systems are pretty good. I'm not very convinced about generalization. It's hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren't that interesting.
(Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.)
But in a sense that's my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn't well-suited for. This can be true at the same time it's true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.
I agree LLMs are a huge step forward towards getting AIs to do human-level moral reasoning, even if I don't agree that we're literally done. IMO the really ambitious 'AI safetyists' should now be asking what it means to get superhuman-level moral reasoning (there's no such thing as perfect, but there sure is better than current LLMs), and how we could get there.
And sadly, just using an LLM as part of a larger AI, even one that reliably produces moral text on the everyday distribution, does not automatically lead to good outcomes, so there's still a role for classical AI safetyism.
-
You could have an ontology mismatch between different parts of the planning process that degrades the morality of the actions. Sort of like translating the text into a different language where the words have different connotations.
-
You could have the 'actor' part of the planning process use out-of-distribution inputs to get immoral-but-convenient behavior past the 'critic'.
-
You could have a planning process that interfaces with the LLM using activations rather than words, and this richer interface could allow RL to easily route around morality and just use the LLM for its world-model.
Yes, because it's wrong. (1) because on a single token a LLM might produce text for reasons that don't generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.
Taking AI companies that are locally incentivized to race toward the brink, and then hoping they stop right at the cliff's edge, is potentially a grave mistake.
One might hope they stop because of voluntary RSPs, or legislation setting a red line, or whistleblowers calling in the government to lock down the datacenters, or whatever. But just as plausible to me is corporations charging straight down the cliff (of building ever-more-clever AI as fast as possible until they build one too clever and it gets power and does bad things to humanity), and even strategizing ahead of time how to avoid obstacles like legislation telling them not to. Local incentives have a long history of dominating people in this way, e.g. people in the tobacco and fossil fuel industries
What would be so much safer is if even the local incentives of cutting-edge AI companies favored social good, alignment to humanity, and caution. This would require legislation blocking off a lot of profitable activity, plus a lot of public and philanthropic money incentivizing beneficial activity, in a convulsive effort whose nearest analogy is the global shift to renewable energy.
(this take is the one thing I want to boost from AI For Humanity.)