Posts
Comments
I often find myself revisiting this post—it has profoundly shaped my philosophical understanding of numerous concepts. I think the notion of conflationary alliances introduced here is crucial for identifying and disentangling/dissolving many ambiguous terms and resolving philosophical confusion. I think this applies not only to consciousness but also to situational awareness, pain, interpretability, safety, alignment, and intelligence, to name a few.
I referenced this blog post in my own post, My Intellectual Journey to Dis-solve the Hard Problem of Consciousness, during a period when I was plateauing and making no progress in better understanding consciousness. I now believe that much of my confusion has been resolved.
I think the concept of conflationary alliances is almost indispensable for effective conceptual work in AI safety research. For example, it helps clarify distinctions, such as the difference between "consciousness" and "situational awareness." This will become increasingly important as AI systems grow more capable and public discourse becomes more polarized around their morality and conscious status.
Highly recommended for anyone seeking clarity in their thinking!
I don't Tournesol is really mature currently, especially for non french content, and I'm not sure they try to do governance works, that's mainly a technical projet, which is already cool.
Yup, we should create an equivalent of the Nutri-Score for different recommendation AIs.
"I really don't know how tractable it would be to pressure compagnies" seems weirdly familiar. We already used the same argument for AGI safety, and we know that governance work is much more tractable than expected.
I'm a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.
Fair enough.
I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models), I don't see how you want to implement ATA, and this isn't really a priority?
I believe we should not create a Sovereign AI. Developing a goal-directed agent of this kind will always be too dangerous. Instead, we should aim for a scenario similar to CERN, where powerful AI systems are used for research in secure labs, but not deployed in the economy.
I don't want AIs to takeover.
Thank you for this post and study. It's indeed very interesting.
I have two questions:
In what ways is this threat model similar to or different from learned steganography? It seems quite similar to me, but I’m not entirely sure.
If it can be related to steganography, couldn’t we apply the same defenses as for steganography, such as paraphrasing, as suggested in this paper? If paraphrasing is a successful defense, we could use it in the control setting, in the lab, although it might be cumbersome to apply paraphrasing for all users in the api.
Interesting! Is it fair to say that this is another attempt at solving a sub problem of misgeneralization?
Here is one suggestion to be able to cluster your SAEs features more automatically between gender and profession.
In the past, Stuart Armstrong with alignedAI also attempted to conduct works aimed at identifying different features within a neural network in such a way that the neural network would generalize better. Here is a summary of a related paper, the DivDis paper that is very similar to what alignedAI did:
The DivDis paper presents a simple algorithm to solve these ambiguity problems in the training set. DivDis uses multi-head neural networks, and a loss that encourages the heads to use independent information. Once training is complete, the best head can be selected by testing all different heads on the validation data.
DivDis achieves 64% accuracy on the unlabeled set when training on a subset of human_age and 97% accuracy on the unlabeled set of human_hair. GitHub : https://github.com/yoonholee/DivDis
I have the impression that you could also use DivDis by training a probe on the latent activations of the SAEs and then applying Stuart Armstrong's technique to decorrelate the different spurious correlations. One of those two algos would enable to significantly reduce the manual work required to partition the different features with your SAEs, resulting in two clusters of features, obtained in an unsupervised way, that would be here related to gender and profession.
Here is the youtube video from the Guaranteed Safe AI Seminars:
It might not be that impossible to use LLM to automatically train wisdom:
Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."
Strongly agree.
Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such individuals are leading AGI labs, the situation will remain quite dire.
+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion). I'm not convinced the goal of the AI Safety community should be to align AIs at this point.
However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like "BadLlama," which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More "technical" works like these seem overwhelmingly positive, and I think that we need more competent people doing this.
Strong agree. I think twitter and reposting stuff on other platforms is still neglected, and this is important to increase safety culture
doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed".
I agree that's a bit too much, but it seems to me that we're not at all on the way to stopping open source development, and that we need to stop it at some point; maybe you think ARA is a bit early, but I think we need a red line before AI becomes human-level, and ARA is one of the last arbitrary red lines before everything accelerates.
But I still think no return to loss of control because it might be very hard to stop ARA agent still seems pretty fair to me.
Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.
I agree with your comment on twitter that evolutionary forces are very slow compared to deliberate design, but that is not way I wanted to convey (that's my fault). I think an ARA agent would not only depend on evolutionary forces, but also on the whole open source community finding new ways to quantify, prune, distill, and run the model in a distributed way in a practical way. I think the main driver this "evolution" would be the open source community & libraries who will want to create good "ARA", and huge economic incentive will make agent AIs more and more common and easy in the future.
Thanks for this comment, but I think this might be a bit overconfident.
constantly fighting off the mitigations that humans are using to try to detect them and shut them down.
Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But:
- 1) It’s not even clear people are going to try to react in the first place. As I say, most AI development is positive. If you implement regulations to fight bad ARA, you are also hindering the whole ecosystem. It’s not clear to me that we are going to do something about open source. You need a big warning shot beforehand and this is not really clear to me that this happens before a catastrophic level. It's clear they're going to react to some kind of ARAs (like chaosgpt), but there might be some ARAs they won't react to at all.
- 2) it’s not clear this defense (say for example Know Your Customer for providers) is going to be sufficiently effective to completely clean the whole mess. if the AI is able to hide successfully on laptops + cooperate with some humans, this is going to be really hard to shut it down. We have to live with this endemic virus. The only way around this is cleaning the virus with some sort of pivotal act, but I really don’t like that.
While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources.
"at the same rate" not necessarily. If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop. The real crux is how much time the ARA AI needs to evolve into something scary.
Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions.
We don't learn much here. From my side, I think that superintelligence is not going to be neglected, and big labs are taking this seriously already. I’m still not clear on ARA.
Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.
This is not the central point. The central point is:
- At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
- the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know
- This may take an indefinite number of years, but this can be a problem
the "natural selection favors AIs over humans" argument is a fairly weak one; you can find some comments I've made about this by searching my twitter.
I’m pretty surprised by this. I’ve tried to google and not found anything.
Overall, I think this still deserves more research
Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn't change too much the basic picture depicted in the OP.
Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head;
- would we react before the point of no return?
- Where should we place the red line? Should this red line apply to labs?
- Is this going to be exponential? Do we care?
- What would it look like if we used a counter-agent that was human-aligned?
- What can we do about it now concretely? Is KYC something we should advocate for?
- Don’t you think an AI capable of ARA would be superintelligent and take-over anyway?
- What are the short term bad consequences of early ARA? What does the transition scenario look like.
- Is it even possible to coordinate worldwide if we agree that we should?
- How much human involvement will be needed in bootstrapping the first ARAs?
We plan to write more about these with @Épiphanie Gédéon in the future, but first it's necessary to discuss the basic picture a bit more.
Thanks for writing this.
I like your writing style, this inspired me to read a few more things
Seems like we are here today
are the talks recorded?
Corrected
[We don't think this long term vision is a core part of constructability, this is why we didn't put it in the main post]
We asked ourselves what should we do if constructability works in the long run.
We are unsure, but here are several possibilities.
Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:
- Using GPT-6 to implement GPT-7-white-box (foom?)
- Using GPT-6 to implement GPT-6-white-box
- Using GPT-6 to implement GPT-4-white-box
- Using GPT-6 to implement Alexa++, a humanoid housekeeper robot that cannot learn
- Using GPT-6 to implement AlexNet-white-box
- Using GPT-6 to implement a transparent expert system that filters CVs without using protected features
Comprehensive AI services path
We aim to reach the level of Alexa++, which would already be very useful: No more breaking your back to pick up potatoes. Compared to the robot Figure01, which could kill you if your neighbor jailbreaks it, our robot seems safer and would not have the capacity to kill, but only put the plates in the dishwasher, in the same way that today’s Alexa cannot insult you.
Fully autonomous AGI, even if transparent, is too dangerous. We think that aiming for something like Comprehensive AI Services would be safer. Our plan would be part of this, allowing for the creation of many small capable AIs that may compose together (for instance, in the case of a humanoid housekeeper, having one function to do the dishes, one function to walk the dog, …).
Alexa++ is not an AGI but is already fine. It even knows how to do a backflip Boston dynamics style. Not enough for a pivotal act, but so stylish. We can probably have a nice world without AGI in the wild.
The Liberation path
Another possible moonshot theory of impact would be to replace GPT-7 with GPT-7-plain-code. Maybe there's a "liberation speed n" at which we can use GPT-n to directly code GPT-p with p>n. That would be super cool because this would free us from deep learning.
Guided meditation path
You are not really enlightened if you are not able to code yourself.
Maybe we don't need to use something as powerful as GPT-7 to begin this journey.
We think that with significant human guidance, and by iterating many many times, we could meander iteratively towards a progressive deconstruction of GPT-5.
- Going from GPT-5 to GPT-2-hybrid seems possible to us.
- Improving GPT-2-hybrid to GPT-3-hybrid may be possible with the help of GPT-5?
- ...
If successful, this path could unlock the development of future AIs using constructability instead of deep learning. If constructability done right is more data efficient than deep learning, it could simply replace deep learning and become the dominant paradigm. This would be a much better endgame position for humans to control and develop future advanced AIs.
Path | Feasibility | Safety |
---|---|---|
Comprehensive AI Services | Very feasible | Very safe but unstable in the very long run |
Liberation | Feasible | Unsafe but could enable a pivotal act that makes things stable in the long run |
Guided Meditation | Very Hard | Fairly safe and could unlock a safer tech than deep learning which results in a better end-game position for humanity. |
You might be interested in reading this. I think you are reasoning in an incorrect framing.
I have tried Camille's in-person workshop in the past and was very happy with it. I highly recommend it. It helped me discover many unknown unknowns.
Deleted paragraph from the post, that might answer the question:
Surprisingly, the same study found that even if there were an escalation of warning shots that ended up killing 100k people or >$10 billion in damage (definition), skeptics would only update their estimate from 0.10% to 0.25% [1]: There is a lot of inertia, we are not even sure this kind of “strong” warning shot would happen, and I suspect this kind of big warning shot could happen beyond the point of no return because this type of warning shot requires autonomous replication and adaptation abilities in the wild.
- ^
It may be because they expect a strong public reaction. But even if there was a 10-year global pause, what would happen after the pause? This explanation does not convince me. Did the government prepare for the next covid?
in your case, you felt the problem, until you decided that an AI civilization might spontaneously develop a spurious concept of phenomenal consciousness.
This is the best summary of the post currently
Thanks for jumping in! And I'm not that emotionally struggling with this, this was more of a nice puzzle, so don't worry about it :)
I agree my reasoning is not clean in the last chapter.
To me, the epiphany was that AI would rediscover everything like it rediscovered chess alone. As I've said in the box, this is a strong blow to non-materialistic positions, and I've not emphasized this enough in the post.
I expect AI to be able to create "civilizations" (sort of) of its own in the future, with AI philosophers, etc.
Here is a snippet of my answer to Kaj, let me know what you think about it:
I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask people: "Yo, what do you think AI is not able to do? Creativity? Ok do you know....". At the end of the day, I would aim to convince them that anything humans are able to do, we can reconstruct everything with AIs. I'd put my confidence level for this at around 95%. Once we reach that point, I agree I think it will become increasingly difficult to argue that the hard problem of consciousness is still unresolved, even if part of my intuition remains somewhat perplexed. Maintaining a belief in epiphenomenalism while all the "easy" problems have been solved is a tough position to defend - I'm about 90% confident of this.
Thank you for clarifying your perspective. I understand you're saying that you expect the experiment to resolve to "yes" 70% of the time, making you 70% eliminativist and 30% uncertain. You can't fully update your beliefs based on the hypothetical outcome of the experiment because there are still unknowns.
For myself, I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask people: "Yo, what do you think AI is not able to do? Creativity? Ok do you know....". At the end of the day, I would aim to convince them that anything humans are able to do, we can reconstruct everything with AIs. I'd put my confidence level for this at around 95%. Once we reach that point, I agree I think it will become increasingly difficult to argue that the hard problem of consciousness is still unresolved, even if part of my intuition remains somewhat perplexed. Maintaining a belief in epiphenomenalism while all the "easy" problems have been solved is a tough position to defend - I'm about 90% confident of this.
So while I'm not a 100% committed eliminativist, I'm at around 90% (when I was at 40% in chapter 6 in the story). Yes, even after considering the ghost argument, there's still a small part of my thinking that leans towards Chalmers' view. However, the more progress we make in solving the easy and meta-problems through AI and neuroscience, the more untenable it seems to insist that the hard problem remains unaddressed.
a non-eliminativist might be perfectly willing to grant that yes, we can build the entire pyramid, while also holding that merely building the pyramid won't tell us anything about the hard problem nor the meta-problem.
I actually think a non-eliminativist would concede that building the whole pyramid does solve the meta-problem. That's the crucial aspect. If we can construct the entire pyramid, with the final piece being the ability to independently rediscover the hard problem in an experimental setup like the one I described in the post, then I believe even committed non-materialists would be at a loss and would need to substantially update their views.
hmm, I don't understand something, but we are closer to the crux :)
You say:
- To the question, "Would you update if this experiment is conducted and is successful?" you answer, "Well, it's already my default assumption that something like this would happen".
- To the question, "Is it possible at all?" You answer 70%.
So, you answer 99-ish% to the first question and 70% to the second question, this seems incoherent.
It seems to me that you don't bite the bullet for the first question if you expect this to happen. Saying, "Looks like I was right," seems to me like you are dodging the question.
That sounds like it would violate conservation of expected evidence:
Hum, it seems there is something I don't understand; I don't think this violates the law.
I don't see how it does? It just suggests that a possible approach by which the meta-problem could be solved in the future.
I agree I only gave the skim of the proof, it seems to me that if you can build the pyramid, brick by brick, then this solved the meta-problem.
for example, when I give the example of meta-cognition-brick, I say that there is a paper that already implements this in an LLM (and I don't find this mysterious because I know how I would approximately implement a database that would behave like this).
And it seems all the other bricks are "easily" implementable.
Let's put aside ethics for a minute.
"But it wouldn't be necessary the same as in a human brain."
Yes, this wouldn't be the same as the human brain; it would be like the Swiss cheese pyramid that I described in the post.
Your story ended on stating the meta problem, so until it's actually solved, you can't explain everything.
Take a look at my answer to Kaj Sotala and tell me what you think.
Thank you for the kind words!
Saying that we'll figure out an answer in the future when we have better data isn't actually giving an answer now.
Okay, fair enough, but I predict this would happen: in the same way that AlphaGo rediscovered all of chess theory, it seems to me that if you just let the AIs grow, you can create a civilization of AIs. Those AIs would have to create some form of language or communication, and some AI philosopher would get involved and then talk about the hard problem.
I'm curious how you answer those two questions:
- Let's say we implement this simulation in 10 years and everything works the way I'm telling you now. Would you update?
- What is the probability that this simulation is possible at all?
If you expect to update in the future, just update now.
To me, this thought experiment solves the meta-problem and so dissolves the hard problem.
But I have no way to know or predict if it is like something to be a fish or GPT-4
But I can predict what you say; I can predict if you are confused by the hard problem just by looking at your neural activation; I can predict word by word the following sentence that you are uttering: "The hard problem is really hard."
I would be curious to know what you think about the box solving the meta-problem just before the addendum. Do you think it is unlikely that AI would rediscover the hard problem in this setting?
I would be curious to know what you think about the box solving the meta-problem just before the addendum.
Do you think it is unlikely that AI would rediscover the hard problem in this setting?
I'm not saying that LeCun's rosy views on AI safety stem solely from his philosophy of mind, but yes, I suspect there is something there.
It seems to me that when he says things like "LLMs don't display true understanding", "or true reasoning", as if there's some secret sauce to all this that he thinks can only appear in his Jepa architecture or whatever, it seems to me that this is very similar to the same linguistic problems I've observed for consciousness.
Surely, if you will discuss with him, he will say things like "No, this is not just a linguistic debate, LLMs cannot reason at all, my cat reasons better": This surely indicates a linguistic debate.
It seems to me that LeCunis is basically an essentialist of his Jepa architecture, as the main criterion for a neural network to exhibit "reasoning".
LeCun's algorithm is something like: "Jepa + Not LLM -> Reasoning".
My algorithm is more something like: "chain-of-thought + can solve complex problem + many other things -> reasoning".
This is very similar to the story I tell for consciousness in the Car Circuit section here.
Sure, "everything is a cluster" or "everything is a list" is as right as "everything is emergent". But what's the actual justification for pruning that neuron? You can prune everything like that.
The justification for pruning this neuron seems to me to be that if you can explain basically everything without using a dualistic view, it is so much simpler. The two hypotheses are possible, but you want to go with the simpler hypothesis, and a world with only (physical properties) is simpler than a world with (physical properties + mental properties).
I would be curious to know what you know about my box trying to solve the meta-problem.
Do you mean that the original argument that uses zombies leads only to epiphenomenalism, or that if zombies were real that would mean consciousness is epiphenomenal, or what?
Both
Frontpage comment guidelines:
- Aim to explain, not persuade
- Try to offer concrete models and predictions
- If you disagree, try getting curious about what your partner is thinking
- Don't be afraid to say 'oops' and change your mind
I don't know, it depends on your definition of "unsolved" and "solved", but I would lean towards "there is a solved hard problem" because the problem was hard, it took me a lot of time (i.e. the meme of the hard problem existed in my head), and my post finally dissolved the question.
When there are difficult decisions to be made, I like to come back to this story.
TLDR of the post: The solution against misuse is for labs to be careful and keep the model behind an API. Probably not that complicated.
My analysis: Safety culture is probably the only bottleneck.
Thank you for writing this, Alexandre. I am very happy that this is now public, and some paragraphs in part 2 are really nice gems.
I think parts 1 and 2 are a must read for anyone who wants to work on alignment, and articulate dynamics that I think extremely important.
Parts 3-4-5, which focus more on Conjecture, are more optional in my opinion, and could have been another post, but are still interesting. This has changed my opinion of Conjecture and I see much more coherence in their agenda. My previous understanding of Conjecture's plan was mostly focused on their technical agenda, CoEm, as presented in a section here. However, I was missing the big picture. This is much better.
I think this is very important, and this goes directly into the textbook from the future.
it seems to me that if we are careful enough and if we implement a lot of the things that are outlined in this agenda, we probably won't die. But we need to do it.
That is why safety culture is now the bottleneck: if we are careful, we are an order of magnitude safer than if we are not. And I believe this is the most important variable.
No, we were just brainstorming internship projects from first principles :)
Why wasn't this crossposted to the alignment forum?
It's better than the first explanation I tried to give you last year.
I don't think it makes sense to "revert to a uniform prior" over {doom, not doom} here.
I'm not using a uniform prior; the 50/50 thing is just me expressing my views, all things considered.
I'm using a decomposition of the type:
- Does it want to harm us? Yes, because of misuses, ChaosGPT, wars, psychopaths firing in schools, etc...
- Can it harm us? This is really hard to tell.
I strongly disagree that AGI is "more dangerous" than nukes; I think this equivocates over different meanings of the term "dangerous," and in general is a pretty unhelpful comparison.
Okay. Let's be more precise: "An AGI that has the power to launch nukes is at least more powerful than nukes." Okay, and now, how would AGI acquire this power? That doesn't seem that hard in the present world. You can bribe/threaten leaders, use drones to kill a leader during a public visit, and then help someone to gain power and become your puppet during the period of confusion à la conquistadors. The game of thrones is complex and brittle; this list of coups is rather long, and the default for a civilization/family reigning in some kingdom is to be overthrown.
I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%
I don't like the word "doom". I prefer to use the expression 'irreversibly messed up future', inspired by Christiano's framing (and because of anthropic arguments, it's meaningless to look at past doom events to compute this proba).
I'm really not sure what should be the reference class here. Yes, you are still living and the human civilization is still here but:
- Napoleon and Hitler are examples of unilateral actions that led to international wars.
- If you go from unilateral action to multilateral actions, and you allow stuff like collusion, things become easier. And collusion is not that wild, we already see this in Cicero: the AI playing as France, conspired with Germany to trick England.
- As the saying goes: "AI is a wonderful tool for the betterment of humanity; AGI is a potential successor species." So maybe the reference class is more something like chimps, neanderthals or horses. Another reference class could be something like Slave rebellion.
I find foom pretty ludicrous, and I don't see a reason to privilege the hypothesis much.
We don't need the strict MIRI-like RSI foom to get in trouble. I'm saying if AI technology does not have the time to percolate in the economy, we won't have the time to upgrade our infrastructure and add much more defense than what we have today, which seems to be the default.
It's not strong evidence; it's a big mess, and it seems really difficult to have any kind of confidence in such a fast-changing world. It feels to me that it's going to be a roughly 50/50 bet. Saying the probability is 1% requires much more work that I'm not seeing, even if I appreciate what you are putting up.
On the offense-defense balance, there is no clear winner in the comment sections here, neither here. We've already seen a takeover between two different roughly equal human civilizations (see the story of the conquistadors) under certain circumstances. And AGI is at least more dangerous than nuclear weapons, and we came pretty close to nuclear war several times. Covid seems to come from gain of function research, etc...
On fast vs slow takeoff, it seems to me that fast takeoff breaks a lot of your assumptions, and I would assign much more than a 1% probability for fast takeoff. Even when you still embrace the compute-centric framework (which I find conservative), you still get wild numbers, like a two-digits probability of takeoff lasting less than a year. If so, we won't have the time to implement defense strategies.
What about Section 1?
Metaculus is now at 2031. I wonder how the folks at OpenPhil have updated since then.
ah, no this is a mistake. Thanks
Here is the summary from the ML Safety Newsletter #11 Top Safety Papers of 2023.
Instrumental Convergence? questions the argument that a rational agent, regardless of its terminal goal, will seek power and dominance. While there are instrumental incentives to seek power, it is not always instrumentally rational to seek it. While there are incentives to become a billionaire, but it is not necessarily rational for everyone to try to become one. Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs, making it often irrational to pursue dominance. Pursuing and maintaining power is costly, and simpler actions often are more rational. Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper.
"Moreover, in multi-agent settings, AIs that seek dominance over others would likely be counteracted by other AIs" --> I've not read the whole thing, but what about collusion?
"Last, agents can be trained to be power averse, as explored in the MACHIAVELLI paper." --> This is pretty weak imho. Yes, you can fine tune GPTs to be honest, etc, or you can filter the dataset with only honest stuff, but at the end of the day, Cicero still learned strategic deception, and there is no principled way to avoid that completely.
"I can't really see a reasonable way for the symmetry to be implemented at all."
Yeah, same.
Here's an example, although it is not reasonable.
You could implement embedding in a vector database. If X1 and X2 are equivalent, embed them with an anti-collinear relationship i.e X1 = - X2. and implement the 'is' operator as a multiplication by -1.
But this fails when there are three vectors that should be equivalent, and it is not very elegant to embed items that should be "equivalent" with an anti-collinear relationship.