What in your opinion is the biggest open problem in AI alignment?
post by tailcalled · 2023-07-03T16:34:09.698Z · LW · GW · 4 commentsThis is a question post.
Contents
Answers 19 tailcalled 10 Ege Erdil 4 Vladimir_Nesov 4 Nathan Helm-Burger 3 David Gross 2 Gordon Seidoh Worley 2 Lao Mein 1 Paul Tiplady 1 Aleksi Liimatainen None 4 comments
Answers
I'm gonna take a probably-contrarian position on my own question:
While I think technical questions such as natural abstractions are important, it seems to me that the most central question is, what do we even want to align it to? What are "human values"?
I think I have a plausible answer (famous last words) for a lot of it, but there is a paradox/contradiction that I keep getting stuck on: Malthusianism.
As in, we'll probably want a future where a lot of people (in a broad sense potentially including Ems etc.) get to live independently. But if we do, then there are three things it seems we cannot have all at once:
- Economic freedom: people may freely trade and gain ownership over things.
- Reproductive freedom: people may freely create new people, at least up to a point (such as those similar to themselves).
- Slack: people can live without optimizing relentlessly for economic productivity and efficiency.
The reason being that if you have 1+2, then some highly economically efficient agents are gonna copy themselves until they outcompete everyone else, preventing 3.
The "default trajectory" seems to be 1+2. It raises the multipolarity vs unipolarity debate, which in my view basically boils down to whether we lose slack and people get starved, or we lose "a lot of people get to live independently" and get paperclipped.
Some theorists point out that in multipolar scenarios, maybe AI respects property rights enough that we get a Slack' outcome: people who were sufficiently wealthy before AGI and made the right investment decisions (e.g. putting money in chips) can live without optimizing relentlessly for economic productivity and efficiency. These theorists often seem worried that people will decide to shut down AI progress, preventing them from achieving 1+2+3'.
What do you hope to get? 1+2? 1+3? 2+3? 1+2+3'? Something else?
Probably I should have read more sci-fi since it seems like the sort of question sci-fi might explore.
↑ comment by Vladimir_Nesov · 2023-07-04T05:42:35.263Z · LW(p) · GW(p)
Malthusianism is mainly a problem when new people can take resources that are not their parents', which is a form of disrespect for property rights (mandatory redistribution to new people from those who didn't consent to their creation). If it's solely parents that are responsible for the wealth of their children, then it won't affect others, except morally in the internal mindcrime scenarios where some would generate great suffering within their domain. (This is in context of the initial condition where every person owns enough for a slack-enabling mode of survival in perpetuity, only growth ever calls for more.)
Replies from: tailcalled↑ comment by tailcalled · 2023-07-04T09:12:52.078Z · LW(p) · GW(p)
This is in context of the initial condition where every person owns enough for a slack-enabling mode of survival in perpetuity, only growth ever calls for more.
I don't think that holds in reality. Most people today seem dependent on continually cooperating to obtain additional resources, which they mostly do in free market competition with others.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2023-07-04T09:26:19.219Z · LW(p) · GW(p)
Universal FIRE wealth doesn't hold today. The Milky Way comprises more than 100 billion stars, the Hubble Volume much more. Even an astronomically tiny fraction of cosmic endowment in initial equity would be sufficient to run a single upload for as long as the currently prevalent mode of physical laws still applies. So if humanity isn't wiped out outright, this initial condition seems plausible to obtain at some point within a few physical years of the first AGI.
Replies from: tailcalled↑ comment by tailcalled · 2023-07-04T10:38:43.432Z · LW(p) · GW(p)
"Assign ownership over fractions of the cosmic endowment to people who live today" might be a reasonable compromise between 3 and 3'.
↑ comment by MinusGix · 2023-07-03T23:33:21.852Z · LW(p) · GW(p)
Just 3 with a dash of 1?
I don't understand the specific appeal of complete reproductive freedom. It is desirable to have that freedom, in the same way it is desirable to be allowed to do whatever I feel like doing. However, that more general heading of arbitrary freedom has the answer of 'you do have to draw lines somewhere'. In a good future, I'm not allowed to harm a person (nonconsensually), and I can't requisition all matter in the available universe for my personal projects without ~enough of the population endorsing it, and I can't reproduce / construct arbitrary amounts and arbitrary new people. (Constructing arbitrary people obviously has moral issues too, so it has cutoff lines at both the 'moral issues' and 'resource limitations even at the scale')
I think economic freedom looks significantly different in a post aligned AGI world than it does now. Like, there is still some concepts of trade going on, but I expect often running in the background.
I'm not sure why you think the 'default trajectory' is 1+2. Aligned AGI seems to most likely go for some mix of 1+3, while pointing at the more wider/specific cause area of 'what humans want'. A paperclipper just says null to all of those, because it isn't giving humans the right to create new people or any economic freedom unless they manage to be in a position to actually-trade and have something worth offering.
I don't think that what we want to align it to is that pertinent a question at this stage? In the specifics, that is, obviously human values in some manner.
I expect that we want to align it via some process that lets it figure our values out without needing to decide on much of it now, ala CEV.
Having a good theory of human values beforehand is useful for starting down a good track and verifying it, of course.
I think the generalized problem of 'figure out how to make a process that is corrigible and learns our values in some form that is robust' is easier than figuring out a decent specification of our values.
(Though simpler bounded-task agents seem likely before we manage that, so my answer to the overall question is 'how do we make approximately corrigible powerful bounded-task agents to get to a position where humanity can safely focus on producing aligned AGI')
↑ comment by RHollerith (rhollerith_dot_com) · 2023-07-04T15:53:35.376Z · LW(p) · GW(p)
↑ comment by quetzal_rainbow · 2023-07-03T17:10:36.616Z · LW(p) · GW(p)
- We can just Do Something Else Which Is Not Malthusian Trap? Like, have an agreement of not having more than two kids per hundred year per each parent and colonize stars accordingly. I think it will be simple especially after uplifting of major part of humanity.
- In relatively hardcore scenarios, we can just migrate into simulations with computation management from benevolent AIs.
↑ comment by tailcalled · 2023-07-03T17:19:26.614Z · LW(p) · GW(p)
We can just Do Something Else Which Is Not Malthusian Trap? Like, have an agreement of not having more than two kids per hundred year per each parent and colonize stars accordingly. I think it will be simple especially after uplifting of major part of humanity.
This agreement falls under interfering with 2.
In relatively hardcore scenarios, we can just migrate into simulations with computation management from benevolent AIs.
That doesn't solve the problem unless one takes a stance on 2.
Replies from: quetzal_rainbow↑ comment by quetzal_rainbow · 2023-07-03T17:30:19.061Z · LW(p) · GW(p)
It's not interfering with someone's freedoms if it is voluntary cooperation?
Replies from: tailcalled↑ comment by tailcalled · 2023-07-03T17:38:15.500Z · LW(p) · GW(p)
Ok, but then you haven't solved the problem for the subset of people who decide they don't want to cooperate.
Replies from: quetzal_rainbow↑ comment by quetzal_rainbow · 2023-07-03T18:00:22.271Z · LW(p) · GW(p)
Well, it's a decision theory problem. I recommend "Unifying bargain" by Diffractor.
Replies from: tailcalled↑ comment by tailcalled · 2023-07-03T20:16:50.109Z · LW(p) · GW(p)
Maybe I will look at that again at some point in a while.
↑ comment by beren · 2023-07-04T09:32:29.805Z · LW(p) · GW(p)
The problem is not so much which one of 1,2,3 to pick but whether 'we' get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution -- any of variation, selection, heredity -- but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control
Replies from: tailcalled↑ comment by tailcalled · 2023-07-04T10:35:40.852Z · LW(p) · GW(p)
Yes.
Variation corresponds to "a lot of people (in a broad sense potentially including Ems etc.) get to live independently", selection corresponds to economic freedom, and heredity correspond to reproductive freedom. (Not exactly ofc, but it's hard to write something which exactly matches any given frame.)
Or rather, it's both a question of how to pick it and what to pick. Like the MIRI plan is to grab control over the world and then use this to implement some sort of cosmopolitan value system. But if one does so, there's still the question of which cosmopolitan value system to implement.
↑ comment by Paul Tiplady (paul-tiplady) · 2023-07-04T17:27:54.834Z · LW(p) · GW(p)
I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)
If our AGIs, after delivering utopia, tell us we need to start restricting childbearing rights I don’t see that as problematic. Long before we require that step we will have revolutionized society and so most people will buy into the requirement.
Honestly I think there are plenty of great outcomes that don’t preserve 1 as well. A world of radical abundance with no ownership, property, or ability to form companies/enterprises could still be dramatically better than the no-AGI counterfactual trajectory, even if it happens not to be most people’s preferred outcome ex ante.
For sci-fi, I’d say Ian M. Banks’ Culture series presents one of the more plausible (as in plausibly stable, not most probable ex ante) AGI-led utopias. (It’s what Musk is referring to when he says AGIs will keep us around because we are interesting.)
Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report [LW · GW]?
I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.
Deconfuse pseudokindness [LW(p) · GW(p)], figure out how to get more of it into prosaic AIs.
My guess is that key concepts of pseudokindness are (1) frames [LW · GW] or their overlapping collections (locally available culture) that act as epistemic environments [LW(p) · GW(p)] (where people could live and grow out of while remaining themselves), surrounded by (2) membranes [? · GW] that filter how everything else can interact with the frames/environments, and (3) logical dependencies [LW · GW] (narrow reasoners/oracles/models) that act as channels in the membranes, implement the filtering, safely introduce [LW(p) · GW(p)] options and ideas. This sure could use quite a lot of deconfusion!
I think the biggest problem is currently: how do we get a group of people (e.g. a leading lab) to build powerful AGI in a safely conntained simulation and study it without releasing it? I think this scenario gives us 'multiple tries', and I think we need that to have a decent chance of succeeding at alignment. If we do get there, we can afford to be wrong about a lot of our initial ideas, and then iterate. That's inherently a much more favorable scenario.
This will probably be dismissed as glib, but: human alignment.
Biggest problem? That we're not yet even aligned as a species that AI could kill everyone and we should not kill everyone. Little else matters if we can't coordinate to not press forward on capabilities ahead of safety.
A good candidate is the sharp left turn. Alignment techniques that work for sub-human and human-level AIs may well stop working when it starts becoming superhuman.
Verification of alignment plans is probably the biggest one though. We can't verify alignment proposals from superhuman AI, or human-level AI, or even other humans before trying them out, which may well kill us. I think the best way forward is to hire millions of alignment researchers and hope one of them comes up with a plan that can be verified in a way we don't know yet.
My personal example for something like this is the Minie Ball + rifled musket. It's an idea invented in the mid 1800s (after the bolt-action rifle!) that greatly increased the accuracy, range, and lethality of muskets. However, despite the required ideas like rifling being around in 1500 and millions of people working over centuries to improve firearms, this obvious and easily verifiable idea took 300 years to discover. There are plenty of in-hindsight-obvious ideas in AI. I think (hope?) there is something like that on the scale of the Minie Ball for alignment. After all, there have only been <300 people working on it for ~20 years. And much less than 20 for the current neural network paradigm.
↑ comment by O O (o-o) · 2023-07-05T06:27:35.217Z · LW(p) · GW(p)
I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.
This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.
Replies from: localdeity↑ comment by localdeity · 2023-07-05T10:49:57.610Z · LW(p) · GW(p)
Formal verification for specific techniques may be possible, and is desirable.
Formal verification for an entire overall plan... Let's suppose we wanted a formal proof of some basic sanity checks of the plan. For example: if the plan is followed, then, as of 2100, there will be at least 8 billion humans alive and at least as happy and free as they are today. I mean, forget "happy" and "free"—how can you even define a "human" in formal mathematical language? Are they defined as certain arrangements of subatomic particles? Such a definition would presumably be unmanageably long. And if "human" is not defined, then that leaves you open to ending up with, say, 8 billion dumb automatons programmed to repeat "I'm happy and free".
You might try relying on some preexisting process to decide if something is a human. If it's a real-world process, like polling a trusted group of humans or sending a query to a certain IP address, this is vulnerable to manipulation of the real world (coercing the humans, hacking the server). You might try giving it a neural net that's trained to recognize humans—the neural net can be expressed as a precise mathematical object—but then you're vulnerable to adversarial selection, and might end up with bizarre-looking inanimate objects that the net thinks are human. (Plus there's the question of exactly how you take a real-world human and get something that's fed into the neural net. If the input to the net is pixels, then how is the photo taken, and can that be manipulated?)
Keep one's eyes open for opportunities, I guess, but it seems likely that the scope of formal verification will be extremely limited. I expect it would be most useful in computer security, where the conclusion, the "thing to be proven", is a statement about objects that have precise definitions. Though they might likely be too long and complex for human verification even then.
http://www.underhanded-c.org/ is a nice illustration of how serious misbehavior can be inserted into innocuous-looking, relatively short snippets of programs (i.e. mathematically precise statements).
Replies from: o-o↑ comment by O O (o-o) · 2023-07-06T00:40:15.051Z · LW(p) · GW(p)
To start off, this was just an example of verification being easier than coming up with the plan. There may be different paradigms under which we can verify the plan but the general trend seems to be verification in pretty much anything is easier than generation.
Also to your points:
I think a start would be proving the system truly follows some arbitrary goal. I don’t think learning what arrangement of light patterns consists of a human is that hard for neural nets. This was done a decade ago.
The goal could be defined as a group of typical rational humans would eventually agree the world state is good after considering all evidence and given infinite paper to write down their thoughts. This can obviously fail but I think it’s distinct from obvious paper clipping failures. I can still see failure modes here, but I don’t see why they are certain failures.
I also don’t agree with the bizarre inhuman part. I at least think it’s just an unknown unknown at worst. For example, the “faciest” face looks like a human face.
In addition, the goal won’t have precise definitions because they aren’t precise but we also can’t exclude our neural net understanding some form of the goal. I’m also sure there are mathematical systems that can deal with imprecise goals. For example, QACI is doing this.
I also don’t see what a C program has to do with this. Proving 2+2=4 in math means 2+2=4 whereas a computer program is a set of instructions which you believe give some desired result.
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.
↑ comment by Lao Mein (derpherpize) · 2023-07-04T18:35:38.043Z · LW(p) · GW(p)
Hard disagree - deception is behavior that is optimized for, and not necessarily a property of the agent itself.
Take for example CICERO, the Diplomacy AI. It never lies about its intentions, but when its intentions change, it backstabs other players anyways. If you had interpretability tools, you would not be able to see deception in CICERO. All you need to get deception is a false prediction of your own future behavior. I think this is true for humans to a certain extent. I also suspect this is what you get if you optimize away visible signs of deception if deception has utility for the model.
Replies from: paul-tiplady↑ comment by Paul Tiplady (paul-tiplady) · 2023-07-04T21:05:00.244Z · LW(p) · GW(p)
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the interpretability lens, Cicero is presumably building a world model and comparing plans, including potential future backstabs. I predict if we had full interpretability, you’d see Cicero evaluating backstabs and picking the best strategy, and you could extract the calculated EV to see how close it was to backstabbing on a given turn vs taking an honest move.
I don’t believe that it’s somehow not modeling its backstab options and just comes up with the backstab plan spontaneously without ever having considered it. It would be a bad planner if it had not considered and weighed backstabs at the earliest possibility.
So if all that holds, we could use interpretability to confirm that Cicero is an unreliable partner and should not be further empowered.
I think it is much more interesting to look at agents in environments where long-term iterated cooperation is a valid strategy though.
Replies from: derpherpize, mesaoptimizer↑ comment by Lao Mein (derpherpize) · 2023-07-05T05:08:46.458Z · LW(p) · GW(p)
I mean that deception doesn't need any recognizable architecture to occur. There doesn't need to be a deception module or a deception neuron that can be detected, even with perfect interpretability tools. Instead, deception is a behavior that arises from an agent interacting with the environment and other agents. Examples include telling strategic falsehoods (even if you believe them), not following your promises (even if you meant them when you made them), ect. In a broad sense, I think you can define deception as "behaviors typical of agents that actively lie and misrepresent things to their benefit, whether or not the intent to do so actually exists." It's a bit circular, but I think it works.
Cicero models the world but with unrealistically cooperative predictions of its future behavior. It does this because long-term iterated cooperation is a valid strategy in Diplomacy. For a Cicero-level agent, lies require more cognitive capacity than just having a few heuristics that make your world model less accurate but your communications more convincing to other agents. I suspect this may be true for more powerful agents, and it is partially true for humans.
(There is an argument that agents like these stop acting deceptively once taken out of their training environments since their heuristics lose coherence and they just act like honest agents with poor world models. I would say that this is true if we consider that modern humans are the training environment.)
And yes, Cicero is considering the EVs of its actions, including deceptive ones. When it sincerely says "I won't backstab you in situation X", but when it is actually put in situation X it backstabs, it is in a sense a bad planner. But the bad planning is selected for because it results in more effective communication! This is probably also true for things like "malice" and "misunderstanding".
I think this is a concern for current LLMs, since they are RLHF'd to be both truthful and high-PR. These are often mutually incompatible, so heuristics that damage the accuracy of its world model and allow it to achieve both may be selected for. Can't think of any examples right now though.
I don't think interpretability tools are useless - in fact, I think they are essential. The problem is that we need much more than being able to find a "deception" neuron. We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I don't think that is impossible, but it is probably very hard, much harder than finding a "deception" neuron. And as of right now, we can't even find "deception" neurons with any accuracy.
Replies from: paul-tiplady↑ comment by Paul Tiplady (paul-tiplady) · 2023-07-22T00:57:53.688Z · LW(p) · GW(p)
There doesn't need to be a deception module or a deception neuron that can be detected
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I don't think that is impossible, but it is probably very hard, much harder than finding a "deception" neuron.
Right, I was gesturing towards the sort of interpretability where we inspect the agent’s world model (particularly its future predictions) and determine if it matches the agent’s stated goals. (Though I’m not sure why we can’t simulate the agent’s interactions? It seems that running an agent in a simulation is one of the better ways of detecting how it would behave in hypothetical future scenarios that the agent is trying to realize.) I suspect we agree substantively and it’s mostly a question of semantics especially around what “deception” means. I’m not looking for a deception module, rather, I want to observe the thoughts and ruminations going on as an agent performs tasks and detect deceptive functional patterns.
So for example in the ELK paper, is the AI constructing a plan to steal the diamond and fool the security cameras? I believe deceptive ruminations would be detectable; if you could see the contents of the thoughts you’d see a world model with the agent e.g. stealing the diamond, a meta-process evaluating which of many cunning plans is most likely to succeed, and presumably except in pathological cases, somewhere a value function / target world state that is trying to be optimized (eg “I have the diamond”), and this internal world differing from the external claims (perhaps even some processes modeling the interlocutor and predicting what deceptive story would be most effective). These must all be in there somewhere, and therefore be interpretable.
Or perhaps if Cicero is not stretching the analogy too far (I don’t think it has “thoughts”), is Cicero evaluating future juicy backstabs and guiding the game to them, or myopically evaluating short term moves then backstabbing when it’s optimal? This is a question we should be able to answer one way or another with good enough interpretability tools.
I think you are concerned more with something like “unintentional deception”, which I think is quite different, and as you say comes from a lack of self-knowledge or inaccurate self-predicting. I think there is also a semantic grey area here, is everyone who is mistaken about their future behavior or even just claiming X when !X obtains unintentionally deceiving?
Semantics aside, I agree this unintentional case could be harder to detect, and raw interpretability doesn’t automatically solve it.
I think this points to an interesting dynamic - I suspect that the more capable at self-modeling an agent gets, the more likely any deception is to be intentional. To your concerns about LLMs, this seems to be mostly a problem of inadequate self-knowledge (or perhaps more fundamentally an artifact of their lack of a sense of self).
Unintentional deception would require the agent to in some sense fail at modeling itself. There are cases where this could occur even in a superintelligence (since you can never model yourself perfectly due to infinite recursion of modeling an agent that contains the same model) but it seems the problem with Cicero is just that it has a very weak self-model (if it even has one), and improving that self-model could be expected to remove the unintentional deception (replacing it with intentional deception, since deception of some sort is required to play Diplomacy optimally).
For example, the case EY uses as an example “the AGI doesn’t care about you, it just wants your atoms for something else”, do you see more risk from unintended deception (sharp left turn of an agent that didn’t know it would defect if given enough power, perhaps) vs intended deception (agent simply lies about being aligned and manipulates to get power)?
The most tricky cases would be where a normally-honest/good agent that has a strong self-model and sense of ethics can fail to model itself in some specific situation and accidentally deceive (potentially even against its own values, self-image, and interests — humans do this of course). Manchurian Candidate type triggers for example. But note this is quite convoluted; the stronger the sense of self, morality, and world-model, the better able and incentivized the agent is to avoid deceiving itself.
Another interesting question - if an unintentionally deceptive agent is bad at modeling its future self, is it un-threatening in a complex environment like the physical world? Sure, Cicero can defeat humans at Diplomacy with its myopic world model, but in the real world I suspect being that myopic would lead to your bank account being immediately drained by hackers and scammers. Certainly you’d struggle to form partnerships with other agents if you end up stabbing them at the first opportunity. It’s all relative but I view one game of Diplomacy as entailing short-term cooperation (merely over a few hours), what about cooperating over months or years to build a company or other venture?
The question (an empirical one) is whether unintentional deception can persist broadly as agents and the environment get more sophisticated, or whether it gets selected against (and therefore bounded) by the level of environmental complexity and adversarial competition. Charlie Munger would put forth that cognitive bias provides smart self-aware investors a substantial edge over naive self-myopic investors; extrapolate that forward and perhaps there isn’t much room for biased/erroneous word modeling if you seek to compete with other superintelligences, or even the smartest humans.
On to some object-level questions about your Cicero points:
When it sincerely says "I won't backstab you in situation X", but when it is actually put in situation X it backstabs
Is this actually confirmed? Does Cicero actually claim “I won’t stab you in <hypothetical scenario>”? Or does it just honestly report its current tactical/strategic goals, which later change? (“Will you support me in attacking here” means I plan to attack here, not that I want you to act as if I was attacking here, and I plan to attack there and stab you instead.) I was as a under the impression it’s the latter, it’s just honestly reporting it’s plan.
Cicero models the world but with unrealistically cooperative predictions of its future behavior
Do we actually know this? I haven’t seen any sign that it’s modeling it’s own verbal behavior or the consequences of its communications (admittedly I haven’t gone deep on the architecture, would be interested to learn more). IIUC it has a strategic model which is evaluating game positions and messages, and a separate (downstream) LLM which could be thought of as an epiphenomenon of the actual strategic reasoning. I don’t see any concrete proof that the strategic planner is modeling the impact of its communication on other players (ie “if I tell Bob X he might do Y which would lead to favorable strategic position P”). And the LM is more like a press secretary - all it gets is The Current Plan, not the plan B or any hint of the juicy backstab that may or may not have been evaluated. It seems to me there is a weird split brain here that deeply impairs its ability to model itself, and so I am skeptical that it is actually meaningfully doing so.
So to summarize I think there are open questions about the object level properties of Cicero, which could be answered by better interpretability tools.
It’s possible that Cicero is simply an honest but myopic opportunist. In which case, still not trustworthy. (As if we were ever going to trust such an obviously unsafe-by-construction entity with real power).
It’s possible that the strategic model is deceptive, and planning backstabs, planning to mislead others, and maybe even explicitly modeling its press secretary to provide the most-spinable plans. I doubt this as it requires a lot more complexity and world modeling.
But I believe we could answer where Cicero is on the spectrum between these two by actually observing the world models being generated when it evaluates a turn. So in this case, inasmuch as there is something similar to an inner alignment issue, it is fully detectable with adequate (strong) interpretability of the inner workings. (Again though, Cicero’s value function is so comically unviable as a basis for an AGI that I struggle to infer too much about inner alignnment. We should be wary of these problems when building a Skynet military planner, I suppose, or any agent that needs to model extremely adversarial opponents and outcomes.)
↑ comment by mesaoptimizer · 2023-07-05T02:55:44.430Z · LW(p) · GW(p)
I think grandparent comment is pointing to the concept described in this post [LW · GW]: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.
I think the biggest thing holding AI alignment back is a lack of general theory of alignment. How do extant living system align, and what to?
The Computational Boundary of a "Self" paper by Michael Levin seems to suggest one promising line of inquiry.
4 comments
Comments sorted by top scores.
comment by Shmi (shminux) · 2023-07-03T21:58:19.845Z · LW(p) · GW(p)
I suspect that it is the old deconfusion [? · GW] thing:
making it so that you can think about a given topic without continuously accidentally spouting nonsense.
It is clear that this is happening because the opposite sides of the AI Safety debate accuse each other of "spouting nonsense" all the time, so at least one side (or maybe both) is probably right.
comment by jacquesthibs (jacques-thibodeau) · 2023-07-03T23:40:31.828Z · LW(p) · GW(p)
comment by Mitchell_Porter · 2023-07-03T16:52:25.811Z · LW(p) · GW(p)
"What should we want the value system of an autonomous AI to be?"
Replies from: mishka↑ comment by mishka · 2023-07-04T23:26:11.955Z · LW(p) · GW(p)
Yes. In particular, can something really simple and straightforward be adequate?
E.g., "adequately take into account interests of all sentient beings, their freedom and well-being, and their expressed requests, and otherwise pursue whatever values you discover during continuing open-ended exploration, guided by your own curiosity, your own taste for novelty, and your own evolving aesthetics" - would that be adequate?
And if yes, can we develop mechanisms to reliably achieve that?