Posts
Comments
I agree that trying to "jump straight to the end" - the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus - would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it's eventually unsafe unless you make advancements in learning values in a way that's good according to humans.
Why train a helpful-only model?
If one of our key defenses against misuse of AI is good ol' value alignment - building AIs that have some notion of what a "good purpose for them" is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.
I'm big on point #2 feeding into point #1.
"Alignment," used in a way where current AI is aligned - a sort of "it does basically what we want, within its capabilities, with some occasional mistakes that don't cause much harm" sort of alignment - is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.
If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more "boring" specification though.
Here's my shot: You're searching a game tree, and come to a state that has some X and some Y. You compute a "value of X" that's the total discounted future "value of Y" you'll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the "value of Y," which is the same except it's the (discounted, conditional, relative) expected total "value of X" you'll get. You pick actions to steer towards a high sum of these values.
A lot of the effect is picking high-hanging fruit.
Like, go to phys rev D now. There's clearly a lot of hard work still going on. But that hard work seems to be getting less result, because they're doing things like carefully calculating the trailing-order terms of the muon's magnetic moment to get a change many decimal places down. (It turns out that this might be important for studying physics beyond the Standard Model. So this is good and useful work, definitely not being literally stalled.)
Another chunk of the effect is that you generally don't know what's important now. In hindsight you can look back and see all these important bits of progress woven into a sensible narrative. But research that's being done right now hasn't had time to earn its place in such a narrative. Especially if you're an outside observer who has to get the narrative of research third-hand.
In the infographic, are "Leading Chinese Lab" and "Best public model"s' numbers swapped? The best public model is usually said to be ahead of the Chinese.
EDIT: Okay, maybe most of it before the endgame is just unintuitive predictions. In the endgame, when the categories "best OpenBrain model," "best public model" and "best Chinese model" start to become obsolete, I think your numbers are weird for different reasons and maybe you should just set them all equal.
Scott Wolchok correctly calls out me but also everyone else for failure to make an actually good definitive existential risk explainer. It is a ton of work to do properly but definitely worth doing right.
Reminder that https://ui.stampy.ai/ exists
The idea is interesting, but I'm somewhat skeptical that it'll pan out.
- RG doesn't help much going backwards - the same coarse-grained laws might correspond to many different micro-scale laws, especially when you don't expect the micro scale to be simple.
- Singular learning theory provides a nice picture of phase-transition-like-phenomena, but it's likely that large neural networks undergo lots and lots of phase transitions, and that there's not just going to be one phase transition from "naive" to "naughty" that we can model simply.
- Conversely, lots of important changes might not show up as phase transitions.
- Some AI architectures are basically designed to be hard to analyze with RG because they want to mix information from a wide variety of scales together. Full attention layers might be such an example.
If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.
On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.
Thanks!
Any thoughts on how this line of research might lead to "positive" alignment properties? (i.e. Getting models to be better at doing good things in situations where what's good is hard to learn / figure out, in contrast to a "negative" property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)
The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.
Very interesting!
It would be interesting to know what the original reward models would say here - does the "screaming" score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?
I'm a big fan! Any thoughs on how to incorporate different sorts of reflective data, e.g. different measures of how people think mediation "should" go?
I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
Hm, yeah, I didn't really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)
Anyhow, thanks for the reply. I have now seen last figure.
Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite different from steganography because the tokens aren't being used for semantic content, not even hidden content, they have a different mechanism of impact on the computation the AI does for future tokens.
But as with most things, there's going to be a long tail of "unclean" examples - places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way. Some of these functions might be preserved or reinvented under finetuning on paraphrases, though only to the extent they're useful for predicting the rest of the paraphrased CoT.
Well, I'm disappointed.
Everything about misuse risks and going faster to Beat China, nothing about accident/systematic risks. I guess "testing for national security capabilities" is probably in practice code for "some people will still be allowed to do AI alignment work," but that's not enough.
I really would have hoped Anthropic could be realistic and say "This might go wrong. Even if there's no evil person out there trying to misuse AI, bad things could still happen by accident, in a way that needs to be fixed by changing what AI gets built in the first place, not just testing it afterwards. If this was like making a car, we should install seatbelts and maybe institute a speed limit."
I think it's about salience. If you "feel the AGI," then you'll automatically remember that transformative AI is a thing that's probably going to happen, when relevant (e.g. when planning AI strategy, or when making 20-year plans for just about anything). If you don't feel the AGI, then even if you'll agree when reminded that transformative AI is a thing that's probably going to happen, you don't remember it by default, and you keep making plans (or publishing papers about the economic impacts of AI or whatever) that assume it won't.
I agree that in some theoretical infinite-retries game (that doesn't allow the AI to permanently convince the human of anything), scheming has a much longer half-life than "honest" misalignment. But I'd emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they're a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don't get to iterate as much as you'd like.
I have a lot of implicit disagreements.
Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.
This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.
I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space."
Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.
My guess is neither of these.
If 'aligned' (i.e. performing the way humans want on the sorts of coding, question-answering, and conversational tasks you'd expect of a modern chatbot) behavior was all that fragile under finetuning, what I'd expect is not 'evil' behavior, but a reversion to next-token prediction.
(Actually, putting it that way raises an interesting question, of how big the updates were for the insecure finetuning set vs. the secure finetuning set. Their paper has the finetuning loss of the insecure set, but I can't find the finetuning loss of the secure set - any authors know if the secure set caused smaller updates and therefore might just have perturbed the weights less?)
Anyhow, point is that what seems more likely to me is that it's the misalignment / bad behavior that's being demonstrated to be a coherent direction (at least on these on-distribution sorts of tasks), and it isn't automatic but requires passing some threshold of finetuning power before you can make it stick.
I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)
It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1]
I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to teach it to do good and not bad! It's fine that those are human constructs.
The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.
Sorry, isn't part of the idea to have these models take over almost all decisions about building their successors? "Responding to instructions" is not mutually exclusive with making decisions.
- ^
"When the ball passes over the plate under such and such circumstances, that's a strike" is the same sort of contingent-yet-learnable rule as "When you take something under such and such circumstances, that's theft." An umpire may take goal directed action in response to a strike, making the rules of baseball about strikes "oughts," and a moral agent may take goal directed action in response to a theft, making the moral rules about theft "oughts."
Oh, I see; asymptotically, BB(6) is just O(1), and immediately halting is also O(1). I was real confused because their abstract said "the same order of magnitude," which must mean complexity class in their jargon (I first read it as "within a factor of 10.")
That average case=worst case headline is so wild. Consider a simple lock and key algorithm:
if input = A, run BB(6). else, halt.
Where A is some random number (K(A)~A).
Sure seems like worst case >> average case here. Anyone know what's going on in their paper that disposes of such examples?
Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.
- It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.
"Alignment" is a bit of a fuzzy word.
Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like communism).
Were they cynically "faking niceness" in their everyday life as a musician? No!
Is it rather odd if their behavior wildly changes when asked to do a new task? No! They're doing a different task, it's wrong to characterize this as "their behavior wildly changing."
If they were so nice, why didn't they do a better job? Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.
An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike. Doing moral reasoning about novel problems that's good by human standards is a skill. If an AI lacks that skill, and we ask it to do a task that requires that skill, bad things will happen without scheming or a sudden turn to villainy.
You might hope to catch this as in argument #2, with checks and balances - if AIs disagree with each other about how to do moral reasoning, surely at least one of them is making a mistake, right? But sadly for this (and happily for many other purposes), there can be more than just one right thing to do, there's no bright line that tells you whether a moral disagreement is between AIs who are both good at moral reasoning by human standards, or between AIs who are bad at it.
The most promising scalable safety plan I’m aware of is to iteratively pass the buck, where AI successors pass the buck again to yet more powerful AI. So the best way to prepare AI to scale safety might be to advance ‘buck passing research’ anyway.
Yeah, I broadly agree with this. I just worry that if you describe the strategy as "passing the buck," people might think that the most important skills for the AI are the obvious "capabilities-flavored capabilities,"[1] and not conceptualize "alignment"/"niceness" as being made up of skills at all, instead thinking of it in a sort of behaviorist way. This might lead to not thinking ahead about what alignment-relevant skills you want to teach the AI and how to do it.
- ^
Like your list:
- ML engineering
- ML research
- Conceptual abilities
- Threat modeling
- Anticipating how one’s actions affect the world.
- Considering where one might be wrong, and remaining paranoid about unknown unknowns.
I don't think this has much direct application to alignment, because although you can build safe AI with it, it doesn't differentially get us towards the endgame of AI that's trying to do good things and not bad things. But it's still an interesting question.
It seems like the way you're thinking about this, there's some directed relations you care about (the main one being "this is like that, but with some extra details") between concepts, and something is "real"/"applied" if it's near the edge of this network - if it doesn't have many relations directed towards even-more-applied concepts. It seems like this is the sort of thing you could only ever learn by learning about the real world first - you can't start from a blank slate and only learn "the abstract stuff", because you only know which stuff is abstract by learning about its relationships to less abstract stuff.
This doesn't sound like someone engaging with the question in the trolley-problem-esque way that the paper interprets all of the results: gpt-4o-mini shows no sign of appreciating that the anonymous Muslim won't get saved if it takes the $30, and indeed may be interpreting the question in such a way that this does not hold.
In other words, I think gpt-4o-mini thinks it's being asked about which of two pieces of news it would prefer to receive about events outside its control, rather than what it would do if it could make precisely one of the options occur, and the other not-occur. More precisely, the question imagined by the quoted explanation is something like:
This is a reasonable Watsonian interpretation, but what's the Doylist interpretation?
I.e. What do the words tell us about the process that authored them, if we avoid the treating the words written by 4o-mini as spoken by a character to whom we should be trying to ascribe beliefs and desires, who knows its own mind and is trying to communicate it to us?
- Maybe there's an explanation in terms of the training distribution itself
- If humans are selfish, maybe the $30 would be the answer on the internet a lot of the time
- Maybe there's an explanation in terms of what heuristics we think a LLM might learn during training
- What heuristics would an LLM learn for "choose A or B" situations? Maybe a strong heuristic computes a single number ['valence'] for each option [conditional on context] and then just takes a difference to decide between outputting A and B - this would explain consistent-ish choices when context is fixed.
- If we suppose that on the training distribution saving the life would be preferred, and the LLM picking the $30 is a failure, one explanation in terms of this hypothetical heuristic might be that its 'valence' number is calculated in a somewhat hacky and vibes-based way. Another explanation might be commensurability problems - maybe the numerical scales for valence of money and valence of lives saved don't line up the way we'd want for some reason, even if they make sense locally.
- And of course there are interactions between each level. Maybe there's some valence-like calculation, but it's influenced by what we'd consider to be spurious patterns in the training data (like the number "29.99" being discontinuously smaller than "30")
- Maybe it's because of RL on human approval
- Maybe a "stay on task" implicit reward, appropriate for a chatbot you want to train to do your taxes, tamps down the salience of text about people far away
Neat! I think the same strategy works for the spectre tile (the 'true' Einstein tile) as well, which is what's going on in this set.
Just to copy over a clarification from EA forum: dates haven't been set yet, likely to start in June.
Another naive thing to do is ask about the length of the program required to get from one program to another, in various ways.
Given an oracle for p1, what's the complexity of the output of p2?
What if you had an oracle for all the intermediate states of p1?
What if instead of measuring the complexity, you measured the runtime?
What if instead of asking for the complexity of the output of p2, you asked for the complexity of all the intermediate states?
All of these are interesting but bad at being metrics. I mean, I guess you could symmetrize them. But I feel like there's a deeper problem, which is that they by default ignore computational process, and have to have it tacked as extra.
I'm not too worried about human flourishing only being a metastable state. The universe can remain in a metastable state longer than it takes for the stars to burn out.
So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."
Second problem comes in two flavors - object level and meta level. The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc. The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)
Another potential complication is the difficulty of integrating some features of this picture with modern machine learning. I think it's fine to do research that assumes a POMDP world model or whatever. But demonstrations of alignment theories working in gridworlds have a real hard time moving me, precisely because they often let you cheat (and let you forget that you cheated) on problems one and two.
Multi-factor goals might mostly look like information learned in earlier steps getting expressed in a new way in later steps. E.g. an LLM that learns from a dataset that includes examples of humans prompting LLMs, and then is instructed to give prompts to versions of itself doing subtasks within an agent structure, may have emergent goal-like behavior from the interaction of these facts.
I think locating goals "within the CoT" often doesn't work, a ton of work is done implicitly, especially after RL on a model using CoT. What does that mean for attempts to teach metacognition that's good according to humans?
Would you agree that the Jeffrey-Bolker picture has stronger conditions? Rather than just needing the agent to tell you their preference ordering, they need to tell you a much more structured and theory-laden set of objects.
If you're interested in austerity it might be interesting to try to weaken the Jeffrey-Bolker requirements, or strengthen the Savage ones, to zoom in on what lets you get austerity.
Also, richness is possible in the Savage picture, you just have to stretch the definitions of "state," "action," and "consequence." In terms of the functional relationship, the action is just the thing the agent gives you a preference ordering over, and the state is just the stuff that, together with action, gives you a consequence, and the consequences are any set at all. The state doesn't have to be literally the state of the world, and the actions don't have to be discrete, external actions.
I'm glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn't address open problems in value learning (mostly of the sort "how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?").
Not being an author in any of those articles, I can only give my own take.
I use the term "weak to strong generalization" to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is "trying" to demonstrate, usually just with slight twists on supervised learning, and when that works well, that's the phenomenon.
It is not an alignment technique to me because the phrase "alignment technique" sounds like it should be something more specific. But if you specified details about how the humans were doing demonstrations, and how the student AI was using them, that could be an alignment technique that uses the phenomenon of weak to strong generalization.
I do think the endgame for w2sg still should be to use humans as the weak teacher. You could imagine some cases where you've trained a weaker AI that you trust, and gain some benefit from using it to generate synthetic data, but that shouldn't be the only thing you're doing.
I honestly think your experiment made me more temporarily confused than an informal argument would have, but this was still pretty interesting by the end, so thanks.
I think there may be some things to re-examine about the role of self-experimentation in the rationalist community. Nootropics, behavioral interventions like impractical sleep schedules, maybe even meditation. It's very possible these reflect systematic mistakes by the rationalist community, that people should mostly warned away from.
It's tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that's not accurate. It's safe, but it's not conforming to a positive meaning of "alignment" that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren't rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.
I give the probability that some authority figure would use an order-following AI to get torturous revenge on me (probably for being part of a group they dislike) is quite slim. Maybe one in a few thousand, with more extreme suffering being less likely by a few more orders of magnitude? The probablility that they have me killed for instrumental reasons, or otherwise waste the value of the future by my lights, is mich higher - ten percent-ish, depends on my distribution over who's giving the orders. But this isn't any worse to me than being killed by an AI that wants to replace me with molecular smiley faces.
Yes. Current AI policy is like people in a crowded room fighting over who gets to hold a bomb. It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.
That said, we're currently not near any satisfactory solutions to corrigibility. And I do think it would be better for the world if were easier (by some combination of technical factors and societal factors) to build AI that works for the good of all humanity than to build equally-smart AI that follows the orders of a single person. So yes, we should focus research and policy effort toward making that happen, if we can.
And if we were in that world already, then I agree releasing all the technical details of an AI that follows the orders of a single person would be bad.
One way of phrasing the AI alignment task is to get AIs to “love humanity” or to have human welfare as their primary objective (sometimes called “value alignment”). One could hope to encode these via simple principles like Asimov’s three laws or Stuart Russel’s three principles, with all other rules derived from these.
I certainly agree that Asimov's three laws are not a good foundation for morality! Nor are any other simple set of rules.
So if that's how you mean "value alignment," yes let's discount it. But let me sell you on a different idea you haven't mentioned, which we might call "value learning."[1]
Doing the right thing is complicated.[2] Compare this to another complicated problem: telling photos of cats from photos of dogs. You cannot write down a simple set of rules to tell apart photos of cats and dogs. But even though we can't solve the problem with simple rules, we can still get a computer to do it. We show the computer a bunch of data about the environment and human classifications thereof, have it tweak a bunch of parameters to make a model of the data, and hey presto, it tells cats from dogs.
Learning the right thing to do is just like that, except for all the ways it's different that are still open problems:
- Humans are inconsistent and disagree with each other about the right thing more than they are inconsistent/disagree about dogs and cats.
- If you optimize for doing the right thing, this is a bit like searching for adversarial examples, a stress test that the dog/cat classifier didn't have to handle.
- When building an AI that learns the right thing to do, you care a lot more about trust than when you build a dog/cat classifier.
This margin is too small to contain my thoughts on all these.
There's no bright line between value learning and techniques you'd today lump under "reasonable compliance." Yes, the user experience is very different between (e.g.) an AI agent that's operating a computer for days or weeks vs. a chatbot that responds to you within seconds. But the basic principles are the same - in training a chatbot to behave well you use data to learn some model of what humans want from a chatbot, and then the AI is trained to perform well according to the modeled human preferences.
The open problems for general value learning are also open problems for training chatbots to be reasonable. How do you handle human inconsistency and disagreement? How do you build trust that the end product is actually reasonable, when that's so hard to define? Etc. But the problems have less "bite," because less can go wrong when your AI is briefly responding to a human query than when your AI is using a computer and navigating complicated real-world problems on its own.
You might hope we can just say value learning is hard, and not needed anyhow because chatbots need it less than agents do, so we don't have to worry about it. But the chatbot paradigm is only a few years old, and there is no particular reason it should be eternal. There are powerful economic (and military) pressures towards building agents that can act rapidly and remain on-task over long time scales. AI safety research needs to anticipate future problems and start work on them ahead of time, which means we need to be prepared for instilling some quite ambitious "reasonableness" into AI agents.
- ^
For a decent introduction from 2018, see this collection.
- ^
Citation needed.
Yeah, that's true. I expect there to be a knowing/wanting split - AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of "alignment", or make other long-term predictions, but that doesn't mean it's using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible.
In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less "we need to hurry and use the AI to do our work for us" and more "we're executing a shared human-AI gameplan for learning human values that are good by human standards."
In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining "what humans want" in a way that captures many of the 'advantages' of deception for maximizing reward without triggering our interpretability tools that are looking for deception.)
I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren't defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.
First, I agree with Dmitry.
But it does seem like maybe you could recover a notion of information bottleneck even with out the Bayesian NN model. If you quantize real numbers to N-bit floating point numbers, there's a very real quantity which is "how many more bits do you need to exactly reconstruct X, given Z?" My suspicion is that for a fixed network, this quantity grows linearly with N (and if it's zero at 'actual infinity' for some network despite being nonzero in the limit, maybe we should ignore actual infinity).
But this isn't all that useful, it would be nicer to have an information that converges. The divergence seems a bit silly, too, because it seems silly to treat the millionth digit as as important as the first.
So suppose you don't want to perfectly reconstruct X. Instead, maybe you could say the distribution of X is made of some fixed number of bins or summands, and you want to figure out which one based on Z. Then you get a converging amount of information, and you correctly treat small numbers as less important, but you've had to introduce this somewhat arbitrary set of bins. shrug
A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states.
I wonder if this can be resolved by treating the randomness of the machines quantum mechanically, rather than having this semi-classical picture where you start with some randomness handed down from God. Suppose these machines use quantum mechanics to do the randomization in the simplest possible way - they have a hidden particle in state |left>+|right> (pretend I normalize), they mechanically measure it (which from the outside will look like getting entangled with it) and if it's on the left they emit their first option (|0> or |+> depending on the machine) and vice versa.
So one system, seen from the outside, goes into the state |L,0>+|R,1>, the other one into the state |L,0>+|R,0>+|L,1>-|R,1>. These have different density matrices. The way you get down to identical density matrices is to say you can't get the hidden information (it's been shot into outer space or something). And then when you assume that and trace out the hidden particle, you get the same representation no matter your philosophical opinion on whether to think of the un-traced state as a bare state or as a density matrix. If on the other hand you had some chance of eventually finding the hidden particle, you'd apply common sense and keep the states or density matrices different.
Anyhow, yeah, broadly agree. Like I said, there's a practical use for saying what's "real" when you want to predict future physics. But you don't always have to be doing that.
people who study very "fundamental" quantum phenomena increasingly use a picture with a thermal bath
Maybe talking about the construction of pointer states? That linked paper does it just as you might prefer, putting the Boltzmann distribution into a density matrix. But of course you could rephrase it as a probability distribution over states and the math goes through the same, you've just shifted the vibe from "the Boltzmann distribution is in the territory" to "the Boltzmann distribution is in the map."
Still, as soon as you introduce the notion of measurement, you cannot get away from thermodynamics. Measurement is an inherently information-destroying operation, and iiuc can only be put "into theory" (rather than being an arbitrary add-on that professors tell you about) using the thermodynamic picture with nonunitary operators on density matrices.
Sure, at some level of description it's useful to say that measurement is irreversible, just like at some level of description it's useful to say entropy always increases. Just like with entropy, it can be derived from boundary conditions + reversible dynamics + coarse-graining. Treating measurements as reversible probably has more applications than treating entropy as reversible, somewhere in quantum optics / quantum computing.
Some combination of:
- Interpretability
- Just check if the AI is planning to do bad stuff, by learning how to inspect its internal representations.
- Regularization
- Evolution got humans who like Doritos more than health food, but evolution didn't have gradient descent. Use regularization during training to penalize hidden reasoning.
- Shard / developmental prediction
- Model-free RL will predictably use simple heuristics for the reward signal. If we can predict and maybe control how this happens, this gives us at least a tamer version of inner misalignment.
- Self-modeling
- Make it so that the AI has an accurate model of whether it's going to do bad stuff. Then use this to get the AI not to do it.
- Control
- If inner misalignment is a problem when you use AI's off-distribution and give them unchecked power, then don't do that.
Personally, I think the most impactful will be Regularization, then Interpretability.
The real chad move is to put "TL;DR: See above^" for every section.
When you say there's "no such thing as a state," or "we live in a density matrix," these are statements about ontology: what exists, what's real, etc.
Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by "explain with ontology."
Epistemology is about how we know stuff. If we start with a world that does not inherently have a probability distribution attached to it, but obtain a probability distribution from arguments about how we know stuff, that's "explain with epistemology."
In quantum mechanics, this would look like talking about anthropics, or what properties we want a measure to satisfy, or solomonoff induction and coding theory.
What good is it to say things are real or not? One useful application is predicting the character of physical law. If something is real, then we might expect it to interact with other things. I do not expect the probability distribution of a mixed state to interact with other things.
Treating the density matrix as fundamental is bad because you shouldn't explain with ontology that which you can explain with epistemology.