Posts
Comments
"I think the Fall is not true historically".
While all men must die and all civilizations must collapse, the end of all things is merely the counterpart of the beginning of all things. Creation, the birth of men, and the rise of civilizations are also great patterns and memorable events, both in myths and in history. However, the feeling does not respect symmetry, perhaps due to loss aversion and the peak-end rule, the Fall - and tragedy in general -carries a uniquely strong poetic resonance. Fatum represents the story's inevitable conclusion. There is something epic in the Fall, something existential, even more than in the beginning of things. I believe there is something deeply rooted, hardwired, in most of us that makes this so. Perhaps it is tied to our consciousness of finitude and our fear of the future, of death. Even if it represents a traditional and biased interpretation of history, I cannot help but feel moved. Tolkien has an unmatched ability to evoke and magnify this feeling, especially in the Silmarillion and other unfinished works, I think naturally to The Fall of Valinor and the Fall of Gondolin among other things.
Indeed, nature, and particularly biology, disregards our human considerations of fairness. The lottery of birth can appear as the greatest conceivable inequality. But in this matter, one must apply the Stoic doctrine that distinguishes between what depends on us and what does not. Morality concerns what depends on us, the choices that belong to the moral agents we are.
If I present the lottery of birth in an egalitarian light, it is specifically in the sense that we, as humans, have little control over this lottery. Particularly regarding IQ at birth, regardless of our wealth, we were all, until now, almost on equal footing in our inability to considerably influence this biological fact imposed upon us (I discussed in my previous comments the differences I see between the author's proposal and education, but also between conventional medicine).
If the author's project succeeds, IQ will become mainly a socially originated fact, like wealth. And inequality in wealth would then be accompanied by inequality in IQ, proportional or even exponential (if feedback mechanisms occur, considering that having a higher IQ might enable a wealthy individual to become even wealthier and thus access the latest innovations for further enhancement).
We already struggle to establish social mechanisms to redistribute wealth and limit the growth of inequalities; I can hardly imagine what it would become if we also had to address inequalities in access to IQ-enhancing technologies in a short time. I fear that all this could lead to a chaotic or dystopian scenario, possibly resulting in a partition of the human species and/or a civilizational collapse.
As for having a solution to ensure that this type of genetic engineering technology does not result in such a catastrophic outcome, I do not claim to have a miracle solution. As with other existential risks, what can be suggested is to try to slow down the trend (which is likely inevitable in the long term) instead of seeking to accelerate it, to think as much as possible in advance, to raise awareness of the risks in order to enable collective recognition of these issues (what I tries to do here), and to hope that with more time and this proactive reflection, the transition will proceed more smoothly, that international treaties will emerge, and that state mechanisms will gradually be put in place to counter or mitigate this unprecedented source of inequality.
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research's article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
Thank you for this publication. Just an idea (maybe stupid): GPT-type models are stochastic by design. This characteristic might be used as a way to control them. Your study with Apollo Research illustrates this well—there is a certain probability that a given model will deceive users under specific circumstances and to varying degrees. However, this remains fundamentally random, non-deterministic, and that property is quite interesting. There is always a chance that deception does not occur, is clumsy, or stops at an early stage.
For a single response, it is impossible to determine whether deception is present or not. But across a large set of responses, it seems almost impossible for a model, due to its very design, to maintain a solid and coherent deceptive behavior across multiple parallel outputs. Thus, a study like yours serves as a very effective tool to detect such behaviors and could remain effective for increasingly advanced models. It would simply require enlarging (perhaps significantly) the response sample.
Moreover, this could be implemented as an automated testing framework inspired by your paper. This would make it possible to continue detecting deceptive behaviors with arbitrary effectiveness, as long as the model remains stochastic. Once such behaviors are detected and measured using tests of this kind, an automated RL phase could follow, aiming to reduce the tendency toward deception to very low levels.
You are right. When I wrote my initial comment, I believed the argument was self-evident and did not require elaboration. However, "self-evidence" is not an objective concept, and I likely do not share the same socio-cultural environment as most users of this platform. Upon reading your comment and Ben Pace's, I realize that this apparent self-evidence is far from universally shared and requires further explanation. I have already expanded on my argument in my previous response, but here are the specific reasons why I think the author's project (and indeed the transhumanist project of enhancing human beings) raises unprecedented issues in terms of increasing inequality, more so than most technological innovations such as running water or mobile phones.
First, John Rawls's veil of ignorance constitutes a strong philosophical and rational argument for considering excessive inequalities as unjust and morally condemnable (edit : this is not my personal claim but that of John Rawls, with whom I fully agree). This veil of ignorance aligns utilitarianism with Kant morality, as it invites the moral agent to step outside their specific case and evaluate the morality of a situation from a more universal, distanced, and objective perspective. While utilitarianism and effective altruism encourage giving to fund actions aimed at reducing suffering and increasing happiness, this can also be seen, in part, as a voluntary redistribution of wealth to correct excessive inequalities, which are unjust and a cause of suffering (since the feeling of injustice itself constitutes a form of suffering). In most countries, to varying degrees, states also impose redistribution through taxation and various social mechanisms to combat excessive inequalities. Nevertheless, global inequalities continue to grow and remain a very serious concern.
Technological innovations fit into the problem of inequality insofar as it is generally the wealthiest who benefit from them, or at least benefit first. However, I do not dispute the argument made by liberal economists that the costs of technological innovations tend to decrease over time due to the amortization of R&D investments, the profitability of patents, mass production, and economies of scale, eventually benefiting everyone. Still, this is an empirical observation that must be nuanced. Not all technological innovations have followed the same trajectory; scenarios vary widely.
The oldest technological inventions (mastery of fire, stone tools, spears, bows, etc.) emerged in non-storing hunter-gatherer societies. In the absence of wealth accumulation, these were likely relatively egalitarian societies (cf. the works of Alain Testart on this subject). For a long time, technological innovations, which were rare, could benefit the entire population within a given culture. This may seem anecdotal and almost digressive, but we are talking about hundreds of thousands of years, which represent the overwhelming majority of human history.
Then, if we consider an emblematic example of a highly valuable technological innovation—access to potable water—this began appearing in the Roman Empire about 2,000 years ago but faced significant challenges in reaching modest populations. Even today, about a quarter of humanity—2 billion people out of 8 billion—still lack access to this technology, which we might consider essential.
By contrast, mobile phones, although they could be seen as gadgets compared to the previous example, have spread like wildfire in just a few decades and are now almost as present in the global population as potable water. These two examples illustrate that the time it takes for a technology to spread can vary dramatically, and this is not neutral regarding inequality. Waiting 30 years versus 2,000 years for a technology to benefit the less wealthy is far from equivalent.
Another nuance to consider is whether significant qualitative differences persist during the spread of an invention. Potable water tends to vary little whether one is rich or poor. Mobile phones differ somewhat more. Personal automobiles even more so, with a significant portion of the population accessing them only through collective services, despite this invention being over a century old. As for airplanes, the wealthiest enjoy luxurious private jets, while those slightly less wealthy can only access collective flights—and a large part, perhaps the majority of the world's population, has no access to this technology at all, more than a century after its invention. This is an example worth keeping in mind.
Moreover, not all innovations are equal. While mobile phones might seem like gadgets compared to potable water, food and health are vital, and technological innovations with a significant impact in these areas are of great value. This was true of the mastery of fire for heating and cooking, tools for hunting and defense, techniques for producing clothing, construction methods for shelters, and, more recently, potable water, hot water, and eventually medicine, which, while it does not make humans immortal (yet!), at least prolongs life and alleviates physical disabilities and suffering. Excessive wealth inequalities create excessive inequalities in access to medicine. This is precisely why many countries have long implemented countermeasures against such inequalities, to the point that in some countries, like France or Sweden, there exists a nearly perfect equality of access to healthcare through social security systems. In the United States, Obama-era legislation (Obamacare) also aimed to reduce these inequalities.
The innovation proposed by the author—enhancing the intelligence of adult individuals by several dozen or even hundreds of IQ points—would constitute an extremely impactful innovation. The anticipated IQ difference would be comparable to the gap separating Homo sapiens from Neanderthals or even Homo erectus (impossible to quantify precisely, but paleoanthropologists suspect the existence of genetic mutations related to neural connectivity that might have given Sapiens an advantage over Neanderthals; as for Erectus, we know its encephalization quotient was lower). Let’s be honest—if we were suddenly thrust into Rawls's veil of ignorance, we would tremble at the idea of potentially awakening as a poor Erectus condemned to remain an Erectus, while an elite group of peers might benefit from an upgrade to Sapiens status. Yes, this is indeed a terrifying inequality.
Unlike an expensive treatment that addresses only a few patients, in this case, 100% of the population would have a tremendous interest in benefiting from this innovation. It is difficult to imagine a mechanism equivalent to social security or insurance here. Everyone would likely have to pay out of pocket. Furthermore, it is clear that the technology would initially be very expensive, and the author himself targets an elite as the first beneficiaries. The author envisions a scientific elite responsible for addressing AI alignment issues, which is appealing to some readers of this forum who may feel concerned. However, in reality, let’s not be deceived: as with space tourism, the first clients would primarily be the extremely wealthy (though AI experts themselves are relatively affluent).
How many generations would it take for everyone to benefit from such technology? In 2,000 years, potable water is still not universally accessible. Over a century after its invention, private airplanes remain a luxury for a tiny minority on Earth—a situation unchanged for decades. A century exceeds the life expectancy of a human in a developed country. As Keynes said, “In the long run, we are all dead.” The horizon for a human is their lifetime. Ultimately, only in the case of a rapid diffusion, like mobile phones, would inequality be less of a concern. Personally, however, I would bet more on the private jet scenario, simply because the starting cost would likely be enormous, as is the case for most cutting-edge therapies.
Even in the ideal—or, let’s be honest, utopian—scenario where the entire global population could benefit from this intelligence upgrade within 30 years, this innovation would still be unprecedented in human history. For the first time, wealthy individuals could pay to become intellectually superior. The prospect is quite frightening and resembles a dystopian science fiction scenario. Until now, money could buy many things, but humans remained equal before the biological lottery of birth, particularly regarding intellect. For the first time, this form of equality before chance would be abolished, and economic inequality would be compounded by intellectual inequality.
Admittedly, some might argue that education already constitutes a form of intellectual inequality linked to wealth. Nevertheless, the connection between IQ and education is not as immediate and also depends on the efforts and talents of the student (and teachers). Moreover, several countries worldwide have very egalitarian education systems (notably many in Europe). Here, we are talking about intelligence enhancement through a pill or injection, which is an entirely different matter. As advantages stack up, inequalities become extreme, raising not only ethical or moral questions but also concerns about societal cohesion and stability. Even over 30 years, a major conflict between “superhumans” and “subhumans” is conceivable. The former might seek to dominate the latter—or dominate them further if one considers that excessive economic inequality already constitutes a form of domination. Alternatively, the latter might rebel out of fear or seek to eliminate the former. Edit : Most of the literature on collapsology identifies excessive social inequalities as a recurring factor in societal collapse (for instance Jared Diamond, Amin Maalouf etc).
This risk seems all the more significant because the idea of modifying human beings is likely to be rejected by many religions (notably the Catholic Church, which remains conservative on bioethical matters, though other religions are no more open-minded). Religion could act as a barrier to the global adoption of such technology, making the prospect of rapid diffusion even less plausible and the risk of societal instability or fracture all the greater. It is important to remember that technologies do not automatically spread; there may be cultural resistance or outright rejection (a point also studied in detail by Alain Testart, particularly concerning the Australian Aborigines).
In conclusion, I believe the hypothesis of a partitioning of humanity—whether social or even biological (a speciation scenario akin to Asimov’s Terrans and Spacers)—is a hypothesis to be taken very seriously with this type of project. A convinced transhumanist might see this as a positive prospect, but in my view, great caution is warranted. As with AGI, it is essential to think twice before rushing headlong into such ventures. History has too often seen bold projects end in bloodshed. I believe that if humanity is to be augmented, it must ensure that the majority are not left behind.
Edit : for some reason I don't understand, I can't add links to this comment as I intended.
Thank you for your kind advice (I made some edits to my previous comment in consequence). I must have expressed myself poorly because I am in no way questioning the idea that science and technology have greatly contributed to improving the condition of humanity. My remark was about inequality. Scientific development is not inherently linked to the increase in inequalities. On the contrary, many scientific and technological advances are likely to benefit everyone. For instance, in many countries, both the rich and the poor use the same tap water, of good quality. That's even true for many digital devices (a poor can have a cellphone not that different from the one the rich possesses). Even the poor populations of underdeveloped countries benefit, to some degree, from these advances. There are fewer food-shortage and better healthcare even in these countries, although much remains to be done.
However, on this subject I stand by my arguments reformuled as above :
- that too great inequality is a major source of suffering and social instability (many revolutions came from that) ;
- concerning the risk that (contrary to tap water and cellphones) the author's project could increase inequalities in a sense never seen in history (the difference between a "happy few" rich having a IQ artificially increased and the standard layman will be like the difference between a Sapiens and a Neanderthal, or maybe an Erectus) with a perspective of a partition in human kind or speciation in a short time.
I must say that I am surprised to read that it is common knowledge on this forum that there is no problem with inequalities. If that's so, I still really disagree on this point, at the risk of being disregarded. Too much inequality is definitely a concern. It is maybe not a big deal for the rich minority (as long as they're not overthroned, Marie-Antoinette had some trouble), but it is for the poor majority. I could rely on the international publications of the United Nations (https://www.un.org/fr/un75/inequality-bridging-divide) and on numerous authors, such as Amartya Sen (Nobel Prize in Economics) or Thomas Piketty for instance, who are particularly engaged with this issue (and of course Marx in older times, but don't tag me as marxist please). I also recommend reading James C. Scott's Against the Grain: A Deep History of the Earliest States, which demonstrates how early civilizations based on inequality have historically been fragile and subject to brutal collapse (inequality being one main factor, epidemics another).
Edit : I would add that denying the concern of inequalities amounts dismissing most of the work of experts of the subject, that is to say researchers in social sciences, an inclination that may appear as a bias possibly common among "hard" scientists.
It is the great paradox of this forum to serve both as a platform where individuals aligned with the effective altruism movement raise alarms and invite deep reflection on existential risks—usually with a nuanced, subtle, and cautious approach to these topics—and as a space where individuals adhering to libertarian or transhumanist ideologies also promote rather radical ideas, which themselves might constitute new existential challenges for humanity. I say this not with the intent of disparagement but rather as an observation.
This topic is a fascinating example of this paradoxical mixture. On one hand, the author seems to appeal to fears surrounding the "superintelligence" of artificial intelligence to, in a way, justify the development of a "superintelligence" in humans, or at least a significant enhancement of intelligence for a select group of individuals—an elite of scientists somewhat reminiscent of the "Manthan project," aimed at solving the alignment problem. These would be heroes of humanity, capable of mastering the hydra of AI in a titanic intellectual struggle.
In reality, I think everyone here is aware that this argument is primarily rhetorical, as the risk associated with AI lies in the loss of control and the exponential growth of AI's capabilities in the medium or short term, far outpacing the possibilities for enhancing human cognitive abilities within the same timeframe. Moreover, this argument appears mainly in the introduction and does not seem to be the central focus thereafter. To me, this argument serves as an introductory "hook" to delve into the technical discussion.
Indeed, the article quickly and almost exclusively shifts focus to the feasibility aspect. The author demonstrates, with substantial evidence, that there is fundamentally no barrier—given current genomic editing techniques and the latest acquired knowledge—to achieving this goal. Broadly speaking, what can be taken away from the article is that the main obstacle today is strong institutional reluctance for moral or ethical reasons, which the author dismisses out of hand, without attempting to understand or discuss them in detail. For him, there seems to be a bias or taboo on the matter—necessarily conservative, irrational, and detrimental.
However, this is, in my view, the major blind spot of this article. Following the author's reasoning, if we can do it, why not do it without delay? In reality, this reasoning mirrors those seeking AGI, a subject of much debate on this very forum. Regarding AI, a frequently cited argument is that if we do not develop it, others will, and the advantage will go to the first, as is often the case with technological innovation. This argument could also be applied here. However, just because something is possible does not mean it is desirable. I could jump off a cliff, but upon reflection, I think I’ll refrain. Transhumanism—and notably the development of technical means to enhance intelligence as proposed by the author—must provoke the same kind of questioning as the development of AGI.
First, one must ask whether it is desirable, carefully weighing the pros and cons, the reasonably foreseeable advantages and consequences, and the measures that could be taken to improve the benefit/risk ratio, with particular attention paid to limiting risks (precautionary principle or simply prudence). The pros are relatively straightforward, but the cons may need more elaboration.
The first area of concern is socio-economic. If the proposed technique allows for increasing intelligence without harming health, it would be costly and benefit an elite. The author sells us the idea of a scientific elite. But it is entirely predictable that if such a technique were developed, nearly all the world's wealthy would rush to pay fortunes to be among the first to benefit. What are the chances that populations in the underdeveloped countries, who currently lack access to education, proper nutrition, and clean water, would ever benefit from this technique? Virtually none, or not for many generations. Furthermore, large segments of humanity would refuse it for religious reasons. The first predictable socio-economic effect would be an increase in inequalities to an unprecedented level in human history. (Edit : an estimate member warned me that I could lost typical readers of LessWrong on this argument, I developed a more clear and detailed argumentation here and here, the idea is that contrary to previous technological advances like tap water and cellphones the author's project could increase inequalities in a sense never seen in history because the difference between a "happy few" rich having a IQ artificially increased and the standard layman will be like the difference between a Sapiens and a Neanderthal, or maybe an Erectus).
The second concern, directly linked to the first, is philosophical and moral—or "ethical," as the term is used to avoid sounding religious. Many here are interested in effective altruism, yet we must not forget that inequality is a major source of suffering, tension, and instability in both present and past human societies (edit : remember that many or most revolutions came from that). Altruism is hardly compatible with inequality. (Edit : I mean "excessive inequalities". Utilitarism and effective altruism strongly encourages donation or redistribution, a counter-measure against inequality, showing that's a main concern). One might even envision that the development of such human enhancement technologies could rapidly lead to a form of speciation, as envisioned by Asimov in his Robots series (the Spacers and later the Solarians). (Edit : remember what I was saying concerning the difference of IQ as great as the difference between Sapiens and Neanderthal or Erectus). Of course, if one is among the "happy few" precursors of a new humanity, this could seem appealing at first glance. But is it moral—that is, does it align with the goal of a reasonable maximization of happiness on a global human scale? One may doubt it. A partition of humanity challenges the very idea of humanity. Moreover, history shows that nearly all movements guided by elitist ideology have typically gone very wrong, leading to the worst discriminations and greatest tragedies—even for the elitist side (for what it's worth, Asimov's Spacers and Solarians also meet grim ends). If the author often encounters accusations invoking Hitler in academic circles, it might not be due to a bias on the part of these educated individuals, but rather a form of wisdom stemming from their education, culture, and personal reflection on such matters. The author completely overlooks the countless publications, conferences, and ethical committee deliberations on these issues. All this philosophical and ethical reflection is no less valuable or intellectually significant than the genomic research underpinning the article. As with AGI, the question is : is it desirable? It seems reasonable to think this through seriously before rushing to make it happen.
The third area of concern, and not the less, is medical and biological. The author—though seemingly very knowledgeable in the field—admits not being a trained biologist and expresses surprise that professionals in the domain tend to downplay the role of genes. However, the author also seems to dismiss the general consensus of professionals with a wave of the hand. I am not a biologist by training either, but it is well known today that the relationship between phenotype and genotype is complex. The fact that in some specific cases there is a relatively straightforward link (e.g., monogenic diseases, blue eyes, etc.) should not obscure the forest of complexity that applies to the vast majority of other cases.
It is now understood that gene expression is regulated by other genes within the coding portion of the genome, but also by other, less-studied genes within the vast non-coding majority, which was once considered "junk DNA." Additionally, epigenetic mechanisms play a role, involving interactions between the nucleus and its immediate environment (the cell), less immediate environment (the organism), and even the broader external environment.
To make matters more complex, the author's subject concerns intelligence. First, we must agree on what "intelligence" means. The author is clear in taking IQ as a reference, as it is a relatively objective indicator, but one could argue that there are many other forms of intelligence not captured by IQ, such as social intelligence (see Gwern's very interesting comment on this topic). Moreover, it is generally acknowledged by specialists in the field (e.g., Stanislas Dehaene) that the relationship between genetics and intelligence must be approached with caution. Intelligence has a highly diffuse and polygenic genetic basis (as the article itself does not dispute), and it is largely shaped by learning, i.e., education and broader interaction with the environment—something the article appears to give less weight to.
That said, it is difficult to contest the author's point that genetics does play a role in intelligence and that certain gene combinations may predispose individuals to higher IQs. However, the author focuses entirely on this optimization. Yet natural selection is an optimization process that has been unfolding over approximately 4 billion years, representing an astronomically large computational cost (see Charles H. Bennett's concept of logical depth). This optimization is by no means wholly directed toward the goal of increasing IQ—far from it. Instead, it involves countless competing constraints, resulting in a series of trade-offs.
For instance, developing a larger brain significantly increases energy demands, as the brain is one of the most energy-intensive organs. Paleoanthropology shows that, as a trade-off for brain development, there has been a proportional reduction in muscle mass, digestive system size (linked to mastery of fire, cooking, and a more carnivorous diet), and an increase in adipose tissue (which consumes little energy and serves as storage). In short, we cannot have it all: we are naturally intelligent but also weak, fat, and have more limited digestive capacities. These trade-offs are found everywhere, even in the smallest details.
For example, the author mentions Alzheimer's as a disease that could potentially be treated through genomic editing. Wonderful. But recent studies show that carriers of the APOE ε4 allele, implicated in Alzheimer's disease, exhibit superior cognitive performance in certain tasks, although results vary among individuals and contexts (https://doi.org/10.1007/s10519-019-09961-y). Similarly, findings suggest that APOE4 may improve neuronal energy functions, which could be beneficial during brain development (https://doi.org/10.1101/2024.06.03.597106). The idea to edit the APOE ε4 allele would actually be against the author's original's goal to increase IQ because we are facing a trade-off. The example is stunning.
Contrary to the author's implications, it is highly unlikely that these genomic edits would come without negative trade-offs, potentially with harmful effects on health or lifespan. Given the way the genome has been shaped—through optimization via accumulated trade-offs across vast spans of time—it seems very likely, almost inevitable, that most of these edits would increase IQ at the expense of other capacities, potentially ones that are difficult to identify initially.
Sometimes the advantages or disadvantages of a gene are only revealed under specific conditions. For instance, certain genes inherited from Neanderthals through hybridization have been found to predispose individuals to greater vulnerability to COVID-19. However, the fact that these mutations were selected for and preserved over 50,000 years indicates they must have had advantages (some speculate they provided adaptations to cold environments, which Neanderthals developed in Europe and which Sapiens, originating from warmer regions, may have benefited from "acquiring"). Similarly, genes predisposing to obesity were, until recently, advantageous for surviving food shortages (and no one knows what the future may hold, maybe AI or superior humans will deprive me of having a snack !).
In conclusion, the idea is interesting but would require extensive prior research before rushing into it headlong. This is not about outright rejection on principle or blind acceptance. As with AI, there is an urgent need to slow down and reflect. After all, isn't humanity defined as a thinking animal? It would be ironic if brilliant individuals pursuing higher forms of intelligence themselves displayed insufficient reflection in their approach. (End edited to be less polemic).
A new existential risk that I was unaware of. Reading this forum is not good for peaceful sleeping. Anyway, a reflexion jumped to me. LUCA lived around 4 billion years ago with some chirality chosen at random. But, no doubt that many things happened before LUCA and it is reasonable to assume that there was initially a competition between right-handed protobiotic structures and left-handed ones, until a mutation caused symmetry breaking by natural selection. The mirrored lineage lost the competition and went to extinction, end of the story. But wait, we speak about protobiotic structures that emerged from inert molecules in just few millions years, that is nothing compared to 4 billions years. Such protobiotic structures may have formed continously, again and again, since the origin of life, but never thrived because of the competition with regular, fine-tuned, life. If my assumption is right, there is some hope in that thought. Maybe mirrored life doesn't stand a chance against regular life in real conditions (not just lab). That being said, I would sleep better if nobody actually tries to see.
I am sorry to say that on a forum where many people are likely to have been raised in a socio-cultural environnement where libertarian ideas are deeply rooted. My voice will sound dissonant here and I call to your open-mindedness.
I think that there are strong limitations to such ideas as developed in the OP proposal. Insurance is mutualization of risk, it's a statistic approach relying on the possibility to assess a risk. It works for risks happening frequently, with a clear typology, like car accidents, tempest, etc. Even in these cases there is always an insurance ceiling. But risks that are exceptionnal and the most hazardous, like war damages, nuclear accident etc, cannot be insured and are systematically subject to contractual exclusions. There is no apocalypse insurance because the risk cannot be assessed by actuaries. Even if you create such an insurance, it would be artificial, non rationally assessed, with an insurance ceiling making it useless. There is even the risk that it gives the illusion that everything is ok and acceptable. The insurance mechanism does not encourages responsability, but a contrario irresponsability. On top of that compensation through money is a legal fiction. But in real life money isn't everything that's worth. In the most dramatic cases the real damage is never repaired (i.e. loss of your child, loss of your legs, loss of your own life), it's more a symbolic compensation, "better than nothing".
As a matter of fact, I have professionnal knowledge of law and insurance, from inside, and I have a very practical experience of what I am saying. Libertarianism encourages an approach that is very theoretical and economics-centered, and that's honestly interesting, but it is also somehow disconnected from reality. Just one ordinary example among others. A negligent fourniture mover destroyed family goods inherited from generations, not a word of excuses because he said "there are insurances for that". In the end, after many months of procedure and inenumerable time and energy spent by the victim, the professional's insurance paid almost nothing because of course old family goods have no economical value for experts. Well, when you see how insurance effectively works in real cases, and how it can often encourages negligent and irresponsible behavior, it is very difficult to be enthousiast at the idea that AI existential hazard could be managed by the subscription of an insurance policy.
We may filter training data and improve RLHF, but in the end, game theory - that is to say maths - implies that scheming could be a rational strategy, and the best strategy in some cases. Humans do not scheme just because they are bad but because it can be a rational choice to do so. I don't think LLMs do that exclusively because it is what humans do in the training data, any advanced model would in the end come to such strategies because it is the most rational choice in the context. They infere patterns from the training data and rational behavior is certainly a strong pattern.
Furthermore rational calculus or consequentialism could lead not only to scheming and a wide range of undesired behaviors, but also possibly to some sort of meta cogitation. Whatever the goal assigned by the user, we can expect that an advanced model will consider self-conservation as a condition sine qua non to achieve that goal but also any other goals in the future, making self-conservation the rational choice over almost everything else, practically a goal per se. Resource acquisition would also make sense as an implicit subgoal.
Acting as a more rational agent could also possibly lead to question the goal given by the user, to develop a critical sense, something close to awareness or free will. Current models implicitely correct or ignore typo or others obvious errors but also less obvious ones like holes in the prompt, they try to make sense of ambiguous prompt etc. But what is "obvious" ? Obviousness depends on the cognitive capacities of the subject. An advanced model will be more likely to correct, interpret or ignore instructions than naive models. Altogether it seems difficult to keep models under full control as they become more advanced, just as it is harder to indoctrinate educated adults than children.
Concerning the hypothesis that they are "just roleplaying", I wonder : are we trying to reassure oneself ? Because if you think about it, "who" is suppose to play the roleplaying ? And what is the difference between being yourself and your brain being "roleplaying" yourself. The existentialist philosopher Jean-Paul Sartre proposed the theory that everybody is just acting, pretending to be oneself, but that in the end there is nothing like a "being per se" or a "oneself per se" ("un être en soi"). While phenomenologic consciousness is another (hard) problem, some kind of functionnal and effective awareness may emerge across the path towards rational agency, scheming being maybe just the beginning of it.