Posts
Comments
David, I really like that your description discusses the multiple interacting levels involved in self-driving cars (eg. software, hardware, road rules, laws, etc.). Actual safety requires reasoning about those interacting levels and ensuring that the system as a whole doesn't have vulnerabilities. Malicious humans and AIs are likely to try to exploit systems in unfamiliar ways. For example, here are 10 potential harmful actions (out of many many more possibilities!) that AI-enabled malicious humans or AIs could gain benefits from which involve controlling self-driving cars in harmful ways:
1) Murder for hire: Cause accidents that kill a car's occupant, another car's occupant, or pedestrians
2) Terror for hire: Cause accidents that injure groups of pedestrians
3) Terror for hire: Control vehicles to deliver explosives or disperse pathogens
4) Extortion: Demand payment to not destroy a car, harm the driver, or harm others
5) Steal delivery contents: Take over delivery vehicles to steal their contents
6) Steal car components: Take over cars to steal components like catalytic converters
7) Surreptitious delivery or taxis: Control vehicles to earn money for illegal deliveries or use in illegal activities
8) Insurance fraud for hire: Cause damage to property for insurance fraud
9) Societal distraction for hire: Cause multiple crashes to overwhelm the societal response
10) Blocked roads for hire: Control multiple vehicles to block roadways for social harm or extortion
To prevent these and many related harms, safety analyses must integrate analyses on multiple levels. We need formal models of the relevant dynamics of each level and proof of adherence to overall safety criteria. For societal trust in the actual safety of a self-driving vehicle, I believe manufacturers will need to provide machine-checkable "certificates" of these analyses.
Hear Hear! Beautifully written! Safety is a systems issue and if components at any level have vulnerabilities, they can spread to other parts of the system. As you write, we already have effective techniques for formal verification at many of the lower levels: software, OS, chips, etc. And, as you say, the AI level and social levels are more challenging. We don't yet have a detailed understanding of transformer-based controllers and methods like RLHF diminish but don't eliminate harmful behaviors. But if we can choose precise coarse safety criteria, then formal gatekeepers between the AI and the actuators can ensure those constraints are obeyed.
At the social level, you mention legality and violation of laws. Those work well for humans because humans care about being fined or jailed. I think the AI situation is more challenging because it may be unclear what the source of an AI involved in a harmful event was. If a malicious human actor or AI can spin off harmful AI agents anonymously, then after-the-fact punishment doesn't serve as a deterrent. That suggests that we need to both create accountability for the provenance of AIs and to harden infrastructure so that even malicious AIs can't cause harmful outcomes. I believe that both of those can be accomplished using formally verified techniques like Max's and my "provable contracts".
But there is clearly a lot of work to be done! What are the next steps and how long do we have? I've been waiting for LLMs to reach proficiency at theorem proving, verified software synthesis, and autoformalization. Companies like DeepSeek (https://github.com/deepseek-ai/DeepSeek-Prover-V1.5 ), Harmonic (https://harmonic.fun/news ) and xAI (https://x.ai/blog ) are making rapid progress on each of these capabilities. But OpenAI's release of o1 this week may be the enabling event I've been waiting for. For example, check out mathematician Terence Tao's experience with it: https://mathstodon.xyz/@tao/113132502735585408 "I could imagine a model of this capability that was specifically finetuned on Lean and Mathlib, and integrated into an IDE, being extremely useful in formalization projects."
Of course, those new abilities also enable new capability levels for cyberattack and social manipulation. So the urgency for hardening today's infrastructure may happen earlier than many of us expected. That suggests that we may need to rapidly institute coarse protective measures before we can build out the truly human-beneficial infrastructure we ultimately want. But the sooner we can build precise models of our systems, their failure modes, and the full range of adversarial attacks, the less harm humanity will have to suffer.
People seem to be getting the implication we intended backwards. We're certainly not saying "For any random safety property you might want, you can use formal methods to magically find the rules and guarantee them!" What we are saying is "If you have a system which actually guarantees a safety property, then there is formal proof of that and there are many many benefits to making that explicit, if it is practical." We're not proposing any new physics, any new mathematics, or even any new AI capabilities other than further development of current capabilities in theorem proving, verified software synthesis, and autoformalization.
Humanity is in a crisis right now! Even without AI we have disasters like the $5 billion CrowdStrike flaw a few weeks ago, many many cyberattacks disabling critical systems, Boeing airplanes falling apart in the sky, etc. As open source AI cyberattack models advance, every one of today's flaws and security holes are likely to be exploited by a wide variety of malicious actors. We have clear descriptions of desired safety properties for many of today's issues and we have design rules which have been developed to mitgate the problems. But today's software development and engineering practices aren't leading to the safety we need! Software engineers wing it and introduce flaws in their systems and security holes are left unfixed for decades. Boeing outsources the manufacture of jet components to underpaid and unsupervised third parties.
By creating explicit formal representations of these rules, and formal "digital twins" of our systems, we can obtain guarantees that they are being followed. Our papers and my talks describe many techniques for doing this. But we are just scratching the surface. We need many many more people thinking about how to achieve actual safety in the face of today's and tomorrow's AIs.
I'm a little bit shocked at the low level of curiosity and creativity I'm seeing in these discussions. I have not seen any other proposals for actually dealing with the actual problems that real AI systems are likely to cause in the next few years. Attempts at alignment, red teaming, etc. are all very interesting and valuable but do nothing to address the hundreds of millions of open source AIs which are almost as powerful as the frontier closed models and which are easily fine tuned for cyberattack and other harmful actions.
The general approach we have outlined is not optional! Godel's completeness theorem tells us that there is a formal proof of any safety property which actually holds for a system. For complex systems, they are very unlikely to be accidentally safe. So designing and building them with explicit formal models will be necessary. If humans don't do it, then powerful AIs certainly will. But, in that case, we may have very little input into exactly what properties will be guaranteed.
Our proposal is not simple! But it doesn't rely on any exotic new ideas. It does require creativity and dedication to actually work to ensure that humanity survives the next decade. There would be tremendous benefit in simply precisely encoding today's safety criteria and in creating systems that guarantee that those criteria are followed. With some future creativity we can go further and use these techniques to develop much more accurate safety criteria and better designs. I would have expected huge interest and forward movement on this. I'm sad to not see much of that so far.
Thinking more about the question of are there properties which we believe but for which we have no proof. And what do we do about those today and in an intended provable future?
I know of a few examples, especially in cryptography. One of the great successes of theoretical cryptography was the reduction the security of a whole bunch of cryptographic constructs to a single one: the existence of one way functions which are cheap to compute but expensive to invert: https://en.wikipedia.org/wiki/One-way_function That has been a great unifying discovery there and the way the cryptographers deal with it is that they just add the existence of one way functions as an extra axioms to their formal system! It does mean that if it turns out not to be true, then a lot of their proven secure system may actually not be. Fortunately, there is "Information-Theoretic Cryptography" https://www.cambridge.org/us/universitypress/subjects/engineering/communications-and-signal-processing/information-theoretic-cryptography?format=HB which doesn't rely on any unproven assumptions but is somewhat more inconvenient to use.
Then there's "Public Key Cryptography" which rests on much dicier assumptions (such as factoring being hard). We already know that a bunch of those assumptions no longer hold in quantum computation but NIST recently announced 3 "post-quantum" standards https://csrc.nist.gov/projects/post-quantum-cryptography but I think there is still quite a lot of worry about them
More generally, mathematicians often have statements which they believe to be true (eg. P!=NP, the Riemann hypothesis, and others: https://en.wikipedia.org/wiki/List_of_unsolved_problems_in_mathematics On what evidence do mathematicians believe these statements? What happens if they use them in proofs of other statements?
Timothy Gowers wrote an insightful essay into these questions: "What Makes Mathematicians Believe Unproved Mathematical Statements?" https://www.semanticscholar.org/paper/What-Makes-Mathematicians-Believe-Unproved-Gowers/b17901cece820de845e57456eda06f892b5ba199
What does this mean for provable safety? One can always add any beliefs one likes to one's axioms and prove things from there! If the axioms you have added turn out to be false, that can undo the guarantees about anything which depended on them. That suggests that one should try to limit unproven assumptions as much as possible! And it also seems to be a great strategy to follow the cryptographers and try to capture the essence of an assumption in a "standard" assumption whose validity can be tested in a variety of contexts.
Physics aims to have a precise fundamental theory and to mathematically derive the consequences of that theory to explain all phenomena. From a mathematical point of view, physicists are notoriously "mathematically sloppy" using techniques which may often give the right answer but which may not be provable (eg. different perturbation methods, renormalization, path integrals, etc.) But, fortunately, more mathematically inclinded physicists have repeatedly come along afterwards and created precise formal models within which the physics derivations are justified.
Two big leaps in the tower of physics models are from deterministic quantum equations (eg. Schrodinger's equation) to statistical representations of measurements and from particle descriptions of matter (eg. an ideal gas) to statistical mechanics representations based on probabilities. Huge literatures explore the meaning and character of those two leaps but my sense is that in general we can't yet formally justify them, but that they are extremely well justified empirically. Physicists call these the assumptions of "Quantum decoherence" https://en.wikipedia.org/wiki/Quantum_decoherence and "Stosszahlansatz" or "Molecular chaos" https://en.wikipedia.org/wiki/Molecular_chaos
How do we deal with them formally? I think we have to do what physicists do and just add those assumptions as axioms to the formal models. This is risky in an adversarial context. We have to watch out for powerful adversaries (ie. powerful AIs) which can control matter at a level which enables them to violate these assumptions. Doesn't seem likely to me, but we must be ever vigilant!
Something I would like to do in these situations but I don't think we have the philosophical underpinnings for is to have precise provable estimates of the probabilities of these being true or false. Gowers makes some attempt at that but I'm not sure it's formal yet. It's a bit weird, it would be a probability for something which is objectively either true or false. So it would be a measure of our knowledge of the state. But it would be valuable for AI safety to have a justifiable measure for how much we need to worry about an adversary being able to violate our assumptions. And, ultimately, our current laws of physics are of this character. It would be great to have a precise measure of our confidence in various physical properties like symmetries (eg. time-invariance, translation invariance, rotational invarianc, etc.), conservation laws (mass/energy, momentum, lepton-number, etc.), etc.
Yes, thanks Steve! Very interesting examples! As I understand most chip verification is based on SAT solvers and "Model Checking" https://en.wikipedia.org/wiki/Model_checking . This is a particular proof search technique which can often provide full coverage for circuits. But it has no access to any kind of sophisticated theorems such as those in the Lean mathematics library. For small circuits, that kind of technique is often fine. But as circuits get more complex, it is subject to the "state explosion problem".
Looking at the floating point division paper, the double precision divider took 7 hours and 30 minutes indicating a lot of search! But one great thing about proofs is that their validity doesn't depend on how long they take to find or on how big they are.
It looks like they did this verification with the Synopsys VC Formal tools (https://www.synopsys.com/verification/static-and-formal-verification/vc-formal.html ) This looks like a nice toolkit but certainly no higher level mathematical proof. It sounds like it's perfectly adequate for this task. But I wouldn't expect it to extend to the whole system very smoothly.
To see what should be possible as AI theorem provers come on line, ask how the Synopsys engineers designed the whole chip to begin with. Presumably they had arguments in their heads about why their design was correct. Humans are especially bad at complex mathematical components like floating point divide, so it makes great sense to use a kind of brute force tool to back up their intuitions there. But with general reasoning capabilities (eg. as people are doing in Lean, Coq, etc.) it shouldn't be hard to formalize the engineer's intuitive understanding of why the whole chip is correct.
If the engineers already have a mental proof of correctness, what is the benefit of a formal proof? I believe one of the most important aspects of proof is its social role. The proof in the engineer's head is not convincing to anyone else. They can try to explain it to another engineer. But that engineer may or may not share the same background and intutions. And if they try to explain it to their manager, the poor manager mostly just has to go on his intuitive assessment of the engineer and on their reputation.
A formal proof eliminates all that need for trust! With a formal proof, anyone can rapidly check it with 100% reliability. The proven design can be incorporated along with other proven designs and the proofs composed to get a proven design of bigger systems. The user of the chip can check the proof and not need to trust the design company. All of these factors make proofs extremely valuable as part of the social interactions of design and deployment. They would be worth creating independent of the increase in correctness and security.
These papers are a great example of why AI theorem proving is likely to be a complete game changer for safety and security but also for general design and process management. NVIDIA's H100 chip has nearly 13,000 AI-designed circuits https://developer.nvidia.com/blog/designing-arithmetic-circuits-with-deep-reinforcement-learning/ That's just the tip of the iceberg when AI-driven verification becomes the norm.
I agree that would be better than what we usually have now! And is more in the "Swiss Cheese" approach to security. From a practical perspective, we are probably going to have do that for some time: components with provable properties combined in unproven ways. But every aspect which is unproven is a potential vulnerability.
The deeper question is whether there are situations where it has to be that way. Where there is some barrier to modeling the entire system and formally combining correctness and security properties of components to obtain them for the whole system.
Certainly there are hardware and software components whose detailed behavior is computationally complex to predict in advance (eg. searching for solutions to SAT problems or inverting hash functions). So you are unlikely to be able to prove theorems like "For every specification of these n bits in this SAT problem, it will take f(n) time to discover a satisfying value for the remaining bits". But that's fine! It's just that you shouldn't make the correctness or security of your system depend on that! For example, you might put a time bound on the search and have a failsafe path if it doesn't succeed by then. That software does have a provable time bound.
So, in general, systems need to be designed to be correct and safe. If you can't put provable bounds on the safety of a system, then I would argue that you have no business exposing the public to that system.
It would be great to start collecting examples of subcomponents or compositional designs which are especially difficult to prove properties about. My sense is that virtually all of these formal analyses will be done by AIs and not by humans. And I think it will be important to develop libraries and models of problems which are easy to solve formally and provide formal guarantees about. And those which are more difficult.
Thinking about the software verification case, I would argue that every decently written piece of software today, the programmer has an internal argument in their head as to why it is correct and not vulnerable to attacks. Humans are fallible, so their argument may not be correct. But if it is correct, then it shouldn't be difficult to formalize it into a precise formal proof. The "de Bruijn Factor" (https://www.cs.ru.nl/~freek/factor/factor.pdf ) measures how much bigger a formal proof of something is than an informal description. It seems to be between 4 and 10 in current formal systems. So, if a human programmer has confidence in the correctness and security of his code, it should only be a small factor more work for an AI to formally prove that. If the programmer doesn't have that confidence, then I think we have no business deploying it anywhere it might harm humans.
I totally agree in today's world! Today, we have management protocols which are aimed at requiring testing and record keeping to ensure that boats and ships in the state we would like them to be. But these rules are subject to corruption and malfeasance (such as the 420 Boeing jets which incorporated defective parts and yet which are currently flying with passengers: https://doctorow.medium.com/https-pluralistic-net-2024-05-01-boeing-boeing-mrsa-2d9ba398bd54 )
But it appears we are rapidly moving to a world in which much of the physical labor will be done by robots and in which each physical system will have a corresponding "digital twin" (eg. https://www.nvidia.com/en-us/omniverse/solutions/digital-twins/ ).
In that world, we can implement provable formal rules governing every system, from raw materials, to manufacture, to supply chain, to operations, and to maintenance.
In an AI world, much more sophisticated malfeasance can occur. Formal models of domains with proofs of adherence to rules and protection against adversaries is the only way to ensure our systems are safe and effective.
Testing is great for a first pass! And in non-critical and non-adversarial settings, testing can give you actual probabilistic bounds. If the probability distribution of the actual uses is the same as the testing distribution (or close enough to it), then the test statistics can be used to bound the probability of errors during use. I think that is why formal methods are so rarely used in software: testing is pretty good and if errors show up, you can fix them then. Hardware has greater adoption of formal methods because it's much more expensive to fix errors after the fact.
But the real problems arise from adversarial attacks. The statistical correctness of a system doesn't matter to an adversary. They are looking for the weird outlier cases which will enable them to exploit the system (eg. inputs with non-standard characters that break the parser, or super-long inputs which overflow a buffer and enable unexpected access to memory, etc.). Testing can't show the absence of flaws (unless every input is tested!).
I think the increasing plague of cyberattacks is due to adversaries become more sophisticated in their search for non-standard ways of interacting with systems that expose their untested and unproven underbelly. But that kind of sophisticated attack requires highly skilled attackers and those are fortunately still rare.
What is coming, however, are AI-powered cyberattack systems which know all of the standard flaws and vulnerabilities of systems, all of the published 1-day vulnerabilities, all of the latest social engineering techniques discussed on the dark web, and have full access to reverse engineering tools like Ghidra. Those AIs are likely being developed as we speak in various government labs (eg. here is a list of significant recent cybe incidents: https://www.csis.org/programs/strategic-technologies-program/significant-cyber-incidents ).
How long before powerful cyberattack AIs are available on bittorrent to teenage hackers? So, I believe the new reality is that every system, software and hardware need to be proven correct and secure to have any confidence in it. To do that, we are likely to need to use AI-theorem provers and AI-verified software synthesis systems. Fortunately, many groups are showing rapid progress on those!
But that doesn't mean testing is useless. It's very helpful during the development process and in better understanding systems. For final deployment in an environment with powerful AIs, however, I don't think it's adequate any more.
In general, we can't prevent physical failures. What we can do is to accurately bound the probability of them occurring, to create designs which limit the damage that they cause, and to limit the ability of adversarial attacks to trigger and exploit them. We're advocating for humanity's entire infrastructure to be upgraded with provable technology to put guaranteed bounds on failures at every level and to eliminate the need to trust potentially flawed or corrupt actors.
In the case of the ship, there are both questions about the design of that ship's components and its provenance. Why did the backup power not enable the propulsion system to stop? Why wasn't there a "failsafe" anchor which drops if the systems become inoperable? Why didn't the port have tugboats guiding risky ship departures? What was the history of that ship's generators? Etc. With the kind of provable technology that Max and I outlined, it is possible to have provably trustable data about the components of the ship, about their manufacture, about their provenance since manufacture, about the maintenance history of the ship's components, etc.
The author of the main post and other critics argue against formal methods doing complex "magical" things like determining which DNA sequences are safe, how autonomous vehicles should navigate cities, or detecting bad thoughts in huge transformer neural nets. Someday these methods might help with some of those, but those aren't the low hanging fruit we are proposing. In some sense we mainly want to use proof for much more mundane things. What Max and I are arguing for are mechanisms to create software, hardware, and social designs which aren't exploitable by adversarial AIs and to create infrastructure that provides guarantees about its state and behavior. Nothing we are proposing requires sophisticated mathematics that today's grad students couldn't do. Nothing requires new physics or radically new engineering principles. Rather, it is a way to organize current technologies to increase trust and eliminate vulnerabilities.
These technologies enable us to eliminate the need to trust third parties: Was a computation performed accurately? Were there bugs in the program? What data was used to train this model or estimate this probability? What probabilistic program or neural net was used? Was the training done correctly? What is the history of this component? What evidence is there that it was manufactured correctly? These and thousands more cases will enable us to build up a robust infrastructure which is provably not vulnerable to AI-driven attack.
A core aspect of this is that we can use untrusted powerful AIs running on untrusted datacenters in untrusted countries to help us build completely trusted software, hardware, and social protocols. The idea is to precisely specify a task (eg. software spec, hardware spec, solve a mathematically encoded problem, etc.) and have the untrusted AI generate both and answer and a proof (in a system like Lean) that the answer solves the precisely specified problem or design task. We can cheaply and completely reliably check the proof. If it verifies, then we can fully trust the results from the untrusted AI. This enables us to bootstrap the current mess of untrusted and unreliable AIs, flaky and insecure hardware, untrustable people and groups, etc. to build up a *fully* trustable infrastructure. The power and importance of this is immense!
I totally agree! I think this technology is likely to be the foundation of many future capabilities as well as safety. What I meant was that society is unlikely to replace today's insecure and unreliable power grid controllers, train network controllers, satellite networks, phone system, voting machines, etc. until some big event forces that. And that if the community produces comprehensive provable safety design principles, those are more likely to get implemented at that point.
Oh, I should have been clearer. In the first part, I was responding to his "rough approximation..[valid] over short periods of time." claim for formal methods. I was arguing that we can at least copy current methods and in current situations get bridges which actually work rather robustly and for a long time.
And, already, the formal version is better in many respects. For one, we can be sure it is being followed! So we get "correctness" in that domain. A second piece, I think is very important but I haven't figured out how to communicate it. And that's that the proof that a design satisfies design rules is a specific artifact that anyone can check. So it completely changes the social structure of the situation. Instead of having to rely on the "expert" who may or may not be competent, and may or may not be corrupt, each party is empowered with absolute guarantee of correctness. I think this alters many social processes dramatically, but it needs to be fleshed out and better explained.
After that argument, I go to the next piece which you mention. Today's engineering practices are not likely to be robust against powerful adversaries (eg. powerful AI or humans backed up by powerful AI). And I don't think current practices can deal with that very well. In the AI safety space, the typical approach is "red teaming" where humans try to trigger AIs to produce bad scenarios and they see how easy it is and how powerful the attacks are. This can find problems but can't show the absence of safety vulnerabilities.
With mathematical proof, we can systematically consider the entire space of possible actions by adversaries in the specified class. Using techniques like "branch and bound", we can systematically eliminate regions of the action space which are shown to be safe. And if the system is actually safe, there is a proof of that (by Godel's Completeness theorem which says that any property which holds in all models of a set of axioms can be proven from those axioms). If the systems are complex, the proofs can be large, so there is value in "designing for verification".
That's a possibility which provides actual safety against AI and other adversaries and provides detailed information about the value of different features, etc. Several groups are working right now to develop examples of this kind. Hopefully, the process can eventually be automated so that we can put it on the same timescale as AI advancement.
Challenge 4: AI advances, including AGI, are not likely to be disruptively helpful for improving formal verification-based models until it’s too late.
Yes, this is our biggest challenge, I think. Right now very few people have experience with formal systems and historically the approach has been met with continuous misunderstanding and outright hostility.
In 1949 Alan Turing laid out the foundations for provably correct software: "An Early Program Proof by Alan Turing". What should have happened (in my book) is that computer science was recognized as a mathematical science capable of mathematically proving the correctness of its designs. And through the years, there have indeed been many who were inspired by that vision and made great and important contributions.
But, unfortunately, the field was resistant to correctness and we got wave after wave of "sloppy programming" which continues to haunt us to this day. For example, the July 19, 2024 CrowdStrke Incident has been called the largest IT outage in history causing over $10 billion in financial damages. It was caused by a sloppy error in an update to Microsoft Windows security software. This is an outrage and it has been an ongoing outrage almost since the time of Turing.
In another post, I mentioned a similar outrage in scientific computing. It has been known for many decades how to mathematically precisely perform scientific computing. And yet the standard remains "slap the code up against the wall and see if the outputs look reasonable". It is unknown how many scientific results, medical analyses, or engineering designs are flawed because people couldn't be bothered to perform their computations correctly.
Cryptography has similar issues. The mainstream is based on computational cryptography which has unknown vulnerability to powerful AI attack. Meanwhile, provable correct "Information Theoretic Cryptography" languishes in a few academic conferences with very little use in practice.
So humanity is way behind where we should be on the formalization front. Our ancestors sloppiness may now be our downfall. One hope, as you mention, is the rapid advancement of AI theorem proving, AI autoformalization, AI verified program synthesis, and other related areas.
It looks to me that we are now in a race of AI for safety vs. AI for unsafe capabilities. The more people who are aware of the issues and the potential solutions using the new formal technologies, the greater the chance we have to survive.
I think the safest world would be one where humanity does not create AI. Next would be where we have a long enough pause in AI development to carefully create safe infrastructure. Next would be restricting powerful AI to a few highly regulated government labs. Next would be a world with tight controls on GPUs and datacenters with no computational overhang. Unfortunately, it appears that all those ships have already sailed. We now have huge computational overhang and powerful open source AI models. The Llama 3.1 8B model can run on a $90 Raspberry Pi 5. So we have to rebuild the world with safe infrastructure in the face of rapidly improving and uncontrolled AI capabilities.
It appears that the massive needed change will only happen *after* large scale AI-powered attacks and destruction begins. I think the greatest contribution to humanity's survival right now is to create detailed plans for building provably safe infrastructure, so that when the enabling technologies appear and the world begins demanding safe technology, there is a plan for moving forward.
The post's second Challenge is:
Challenge 2 – Most of the AI threats of greatest concern have too much complexity to physically model.
Setting aside for a moment the question of whether we can develop precise rules-based models of physics, GS-based approaches to safety would still need to determine how to formally model the specific AI threats of interest as well. For example, consider the problem of determining whether a given RNA or DNA sequence could cause harm to individuals or to the human species. This is a well-known area of concern in synthetic biology, where experts expect that risks, especially around the synthesis of novel viruses, will dramatically increase as more end-users gain access to powerful AI systems. This threat is specifically discussed in [3] as an area in which the authors believe that formal verification-based approaches can help:
I certainly agree that there are complex situations where we can't currently tell what's safe and what isn't. In that case we should clearly be incredibly conservative and not expose humans to systems with unclear safety properties! The burden is on the creator of a system to demonstrate safety!
Regarding DNA and RNA synthesis, I see the need for provable technologies at a much simpler and lower level than what you are talking about. Right now you can buy DNA and RNA synthesis machines without any kinds of controls on what they will synthesize from many suppliers for just a few thousand dollars. In a world in which the complete sequence of the smallpox DNA and other extremely harmful pathogens, this is completely insane!
It would be nice if we could automatically tell whether a DNA sequence was harmful or not. But, as you say, we are not at the point where AI can do that. But that doesn't mean we throw up our hands and say "Ok, go ahead and synthesize whatever you want! Here's the smallpox DNA in case you're interested!"
Our biohazard labs have strict rules and guidelines for what they can synthesize and how. Human committees must sign off on work that might inadvertently lead to human harm. We need at least that level of control on synthesis devices available in the open market. For example, sequences might need to be cryptographically signed by a governing body stating they are safe before they can be synthesized.
But how can we be sure that the hardware will require those digital signatures? That's where the "provable contract" hardened crytpographic technologies that Max and I describe comes in. It needs to be impossible for adversaries to use synthesis machines to create unsigned potentially harmful sequences.
There is a whole large story about how that works. Fundamentally it involves secure digital hardware similar to Apple's Secure Enclave. But with provably effective tamper sensing which implements "zeroization" (deletion of cryptographic keys on detection of tampering). This needs to be integrated into the core operation of the device.
Similar technology needs to be incorporated into robots and other physical devices which can cause harm. The mathematical proof guarantee is that the device will only operate under conditions specified by formal "contract rules". Those rules can be chosen to be whatever is appropriate for the domain of interest. For biohazard DNA synthesis, it might involve a digital signature from a secure government database of safe sequences or from a regulatory committee overseeing new research.
As a concrete example, consider the recent Francis Scott Key Bridge collapse where an out of control container ship struck one of the piers of the bridge. It killed six people, blocked the Port of Baltimore for 11 weeks, and will cost $1.7 billion to replace the bridge which will take four years.
Could the bridge's designers way back in 1977 have anticipated that a bridge over one of the busiest shipping routes in the United States might some day be impacted by a ship? Could they have designed it to not collapse in this circumstance? Perhaps shore up the base of its piers to absorb the impact of the ship?
This was certainly a tragedy and is believed to have been an accident. But the cause was electrical problems on the ship. "At 1:24 a.m., the ship suffered a "complete blackout" and began to drift out of the shipping channel; a backup generator supported electrical systems but did not provide power to the propulsion system"
How well designed was that ships generator and backup generator? Could adverarial attackers cause this kind of electrical blackout in other ships? Could remote AI systems do it? How many other bridges are vulnerable to impacts from out of control ships? What is the economic value at stake from this kind of flawed safety engineering? How much is it worth to create designs which provide guaranteed protections against this kind of flaw?
Here's the post's first proposed limitation:
Limitation 1 – We will not obtain strong proofs or formal guarantees about the behavior of AI systems in the physical world. At best we may obtain guarantees about rough approximations of such behavior, over short periods of time.
For many readers with real-world experience working in applied math, the above limitation may seem so obvious they may wonder whether it is worth stating at all. The reasons why it is are twofold. First, researchers advocating for GS methods appear to be specifically arguing for the likelihood of near-term solutions that could somehow overcome Limitation 1.
Think about this claimed limitation in the context of today's deployed technologies and infrastructure. There are 5 million bridges around the world that people use every day to cross rivers and chasms. Many of them have lasted for hundreds of years and new ones are built with confidence all the time. Failures do happen, but they are so rare that news stories are written about them and videos are shared on YouTube. How are these remarkably successful technologies built?
Mechanical engineers have built up a body of knowledge about building safe bridges. They have precise rules about the structure of successful bridges. They have detailed equations, simulations, and tables of material properties. And they know the regimes of validity of their models. When they build a bridge according to these rules, they have a confidence in their safety which is nothing like a "rough approximation..[valid] over short periods of time."
We are rapidly moving from human engineers to AI systems for design. What should an AI bridge design system look like? It should certainly encode all the rules that the human engineers use! And, in fact, should give confidence that its designs are actually following those rules, it would be really great to check that a design follows those rules. We could hire a human mechanical engineer to audit the AI design, but we'd really like to automate that process.
Enter formal systems like Lean! We can encode the mechanical engineer's design criteria and particular bridge designs as precise Lean statements. Now the statement that a particular design follows the design criteria becomes a theorem and Lean can provide a proof certificate that it is true. This certificate can be automatically checked by a proof checker and users can have high confidence that the design does indeed follow the design rules without having to trust anyone!
This formalization process is happening right now in hundreds of different disciplines. It is still new, so early attempts may be incomplete and require engineers to learn to use new tools. But AI is helping here as well. The field of "autoformalization" takes textbooks, manuals, design documents, software, scientific articles, mathematics papers, etc. written in natural language and automatically generates precise formal models of them. Large language models are becoming quite good at this (eg. here's a recent paper: "Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization").
As AI autoformalization and theorem proving improves, we should expect the entire corpus of scientific and engineering knowledge to be rapidly represented in a precise formal form. And AI designs which at least meet the current human requirements for safe systems can be automatically generated with proof certificates of adherence to design rules given.
But we cannot stop at the current level of engineering safety. Today's technologies and infrastructure works pretty well in today's human environment. Unfortunately, as AIs become more powerful, they will likely be used in an adverarial way to attack infrastructure, both at the behest of malicious human actors and in the process of satisfying autonomous subgoals. To counter this threat, we need a much higher standard of safety than is common in current engineering practice.
Max and I argue that the only way to have assurance of safety against powerful adversarial agents is through formal methods and mathematical proof. For even moderately complex systems, we will need formal guarantees of safety against every attack path that is available to a specified class of adversaries. This is the level of safety for which we need Provably Safe and Guaranteed Safe approaches to achieve.
An example of this kind of thing is the "Proton Radius Puzzle" https://physicsworld.com/a/solving-the-proton-puzzle/ https://en.wikipedia.org/wiki/Proton_radius_puzzle in which different measurements and theoretical calculations of the radius of the proton differed by about 4%. The physics world went wild and hundreds of articles were published about it! It seems to have been resolved now, though.
Our infrastructure should refuse to do anything the AI asks unless the AI itself provides a proof that it obeys the rules we have set. So we force the intelligent system itself to verify anything it generates!
Oh yeah, by their very nature it's likely to be hard to predict intelligent systems behavior in detail. We can put constraints on them, though, and prove that they operate within those constraints.
Even simple systems like random SAT problems https://en.wikipedia.org/wiki/SAT_solver can have a very rich statistical structure. And the behavior of the solvers can be quite unpredictable.
In some sense, this is the source of unpredictability of cryptographic hash functions. Odet Goldreich proposed an unbelivable simple boolean function which is believed to be one-way: https://link.springer.com/chapter/10.1007/978-3-642-22670-0_10
On the other hand, I think it is often possible to distill behavior for a particlular task from a rich intelligence into simple code with provable properties.
I think we definitely would like a potentially unsafe AI to be able to generate control actions, code, hardware, or systems designs together with proofs that those designs meet specified goals. Our trusted systems can then cheaply and reliably check that proof and if it passes, safely use the designs or actions from an untrusted AI. I think that's a hugely important pattern and it can be extended in all sorts of ways. For example, markets of untrusted agents can still solve problems and take actions that obey desired constraints, etc.
The issue of unaligned AGI hiding itself is potentially huge! I have end state designs that would guarantee peace and abundance for humanity, but they require that all AIs operate under a single proven infrastructure. In the intermediate period between now and then is the highest risk, I think.
And, of course, an adversarial AI will do everything it can to hide and garner resources! One of the great uses of provable hardware is the ability to create controlled privacy. You can have extensive networks of sensors where all parties are convinced by proofs that they won't transmit information about what they are sensing unless a specified situation is sensed. It looks like that kind of technology might allow mutual treaties which meet all parties needs but prevent the "hidden rogue AIs" buried in the desert. I don't understand the dynamics very well yet, though.
While dark energy and dark matter have a big effect on the evolution of the universe as a whole, they don't interact in any measurable way with systems here on earth. Ethan Siegel has some great posts narrowing down their properties based on what we definitively know, eg. https://bigthink.com/starts-with-a-bang/dark-matter-bullet-cluster/ So it's important on large scales but not, say, on the scale of earth. Of course, if we consider the evolution of AI and humanity over much longer timescales, then we will likely need a detailed theory. That again shows that we need to work with precise models which may expand their regimes of applicability.
I'm focused on making sure our infrastructure is safe against AI attacks. This will require that our software not have security holes, that our cryptography not be vulnerable to mathematical attacks, our hardware not leak signals to adversarial sensors, and our social mechanisms not be vulnerable to manipulative interactions. I believe the only practical way to have these assurances is to model our systems formally using tools like Lean and to design them so that adversaries in a specified class provably cannot break specified safety criteria (eg. not leak cryptographic keys). Humans can do these tasks today but it is laborious and usually only attempted when the stakes are high. We are likely to soon have AI systems which can synthesize systems with verified proofs of safety properties rapidly and cheaply. I believe this will be a game changer for safety and am arguing that we need to prepare for that opportunity. One important piece of infrastructure is the hardware that AI runs on. When we have provable protections for that hardware, we can put hard controls on the compute available to AIs, its ability to replicate, etc..
Simulation of the time evolution of models from their dynamical equations is only one way of proving properties about them. For example, a harmonic oscillator https://en.wikipedia.org/wiki/Harmonic_oscillator has dynamical equations m d^2x/dt^2= -kx. You can simulate that but you can also prove that the kinetic plus potential energy is conserved and get limits on its behavior arbitrarily far into the future.
It's very hard to get large gravitational fields. The closest known black hole to Earth is Gaia BH1 which is 1560 light-years away: https://www.space.com/closest-massive-black-hole-earth-hubble The strongest gravitational waves come from the collision of two black holes but by the time they reach Earth they are so weak it takes huge effort to measure them and they are in the weak curvature regime where standard quantum field theory is fine: https://www.ligo.caltech.edu/page/what-are-gw#:~:text=The%20strongest%20gravitational%20waves%20are,)%2C%20and%20colliding%20neutron%20stars.
It's also quite challenging to create high energy particles, they tend to rapidly collide and dissipate their energy. The CERN "Large Hadron Collider" is the most powerful particle accelerator that humans have built: https://home.cern/resources/faqs/facts-and-figures-about-lhc It involves 27 kilometers of superconducting magnets and produces proton collisions of 1.3*10^13eV.
Most cosmic rays are in the range of 10^6 eV to 10^9 eV https://news.uchicago.edu/explainer/what-are-cosmic-rays But there have been a few very powerful cosmic rays detected. Betwen 2004 and 2007, the Pierre Auger Observatory detected 27 events with energies above 5.7 * 10^19 eV and the "Oh-My-God" particle detected in 1991 had an energy of 3.2 * 10^20 eV.
So they can happen but would be extremely difficult for an adversary to generate. The only reason he put 10^11 as a limit is that's the highest we've been able to definitively explore with accelerators. There may be more unexpected particles up there, but I don't think they would make much of a difference to the kinds of devices we're talking about.
But we certainly have to be vigilant! ASIs will likely explore every avenue and may very well be able to discover the "Theory of Everything". We need to design our systems so that we can update them with new knowledge. Ideally we would also have confidence that our infrastructure could detect attempts to subvert it by pushing outside the domain of validity of our models.
On one hand, while we should recognize that modeling techniques like discrete element analysis can produce quantitative estimates of real-world behavior – for example, how likely a drone is to crash, or how likely a bridge is to fail – we should not lose sight of the fact that such estimates are invariably just estimates and not guarantees. Additionally, from a practical standpoint, estimates of this sort for real-world systems most-often tend to be based on empirical studies around past results rather than prospective modeling. And to the extent that estimates are ever given prospectively about real-world systems, they are invariably considered to be estimates, not guarantees.
I think this is a critical point. Every engineering discipline has precise models of their domain. For example, I'm looking at Peter Childs' "Mechanical Design Engineering Handbook" which surveys the field and is full of detailed partial differential equations for the behavior of different components like bearings, gears, clutches, seals, springs, fasteners, etc. It also has many tables showing the values (with significant figures) of parameters like viscosity of lubricants and the fatigue load of ball bearings.
How did they get these models and parameters? It's a combination of inference from fundamental physics and experimental tests. Given more detailed physics understanding, fewer experiments are needed. And even with no physics, probabilistic models with PAC-like guarantees can be built purely from experiments if the test distribution is the same as the training distribution.
To build safe systems, engineers do computations on these models and bound the probability of system failure (hopefully to a low value!). These computations should be purely mathematical with precise answers verified by mathematical proofs. Unfortunately, that has not been the culture in "scientific computing" or "numerical methods" in the United States and often suggested the need for "numerical analysts" to look over your code and ensure that it was "numerically stable"!
In my book, that culture has been a disaster, leading to many failures and deaths. The Europeans developed a culture of "verified scientific computing" (for example, I was enamored of this book: "Numerical Toolbox for Verified Computing I: Basic Numerical Problems Theory, Algorithms, and Pascal-XSC Programs (Springer Series in Computational Mathematics)" which shows how to produce provably correct enclosures around solutions to equations, finding global optima, etc. These days these models can be formalized in a system like Lean and properties deduced using mathematical proof.
How much accuracy do we need and what should we do about incorrect models? For safety, the answer depends on the strengths of the adversary. If your adversary is a typical human, then often weak models suffice. Powerful AI adversaries, however, will look for weaknesses in the desing model and try to exploit them. To have any hope of providing actual security, these processes must be modeled and accounted for.
Most engineering disciplines actually have a kind of "stack" of models, each layer justified by the layer below. For example, in chip design there is a digital circuit layer on top of a physical layout layer on top of detailed electrodynamics and solid state materials, etc. Today's chips do a pretty good job of isolating the layers, so a digital circuit laid out according to the chip process node's design rules will actually implement the intended digital behavior.
But attacks like "Rowhammer" show what can happen when an adversary can affect a layer below the digital layer. By repeatedly accessing certain memory cells, an attacker can flip nearby bits and use this to extract cryptographic keys and violate security "guarantees". As DRAMs have gotten denser, the problem has gotten worse. The human-designed mitigation solutions have led to a cat and mouse game that the attackers are winning. I believe that's a great example of a critically important design problem which can only be solved by using AI and formal methods. The ideal solution would be to design chips which provably implement the intended digital model. But we have a lot of bad DRAM chips out there! I believe we need a formal electrodynamic model of the phenomenon and a provable representation of all access patterns which lead to violations. Then we need verified AI program synthesis to resynthesize programs to provably not generate any of the bad access patterns. I don't think any other technique has a chance of dealing with that kind of issue. And you can be sure that advanced AI's will be hammering away at weak computational infrastructure.
Thanks! His work looks very interesting! He recently did this nice talk which is very relevant: "Formal Verification in Scientific Computing"
Figure 1 in Carroll's paper shows what is going on. At the base is the fundamental "Underlying reality" which we don't yet understand (eg. it might be string theory or cellular automata, etc.):
Above that is the "Quantum Field Theory" level which includes the "Core Theory" which he explicitly shows in the paper and also possibly "Unknown particles and forces". Above that is the "Macro Level" which includes both "Everyday life" which he is focusing on and also "Astrophysics and Cosmology". His claim is that everything we experience in the "Everyday life" level depends on the "Underlying reality" level only through the "Core Theory" (ie. it is an "effective theory" kind of like fluid mechanics doesn't depend on the details of particle interactions).
In particular, for energies less than 10^11 electron volts and for gravitational fields weaker than those around black holes, neutron stars, and the early universe, the results of every experiment is predicted by the Core theory to very high accuracy. If anything in this regime were not predicted to high accuracy, it would be front page news, the biggest development in physics in 50 years, etc. Part of this confidence arises from fundamental aspects of physics: locality of interaction, conservation of mass/energy, and symmetry under the Poincare group. These have been validated in every experiment ever conducted. Of course, as you say, physics isn't finished and quantum theory in high gravitational curvature is still not understood.
Here's a list of other unsolved problems in physics: https://en.wikipedia.org/wiki/List_of_unsolved_problems_in_physics But the key point is that none of these impact AI safety (at least in the nearterm!). Certainly, powerful adversarial AI will look for flaws in our model of the universe as a potential opportunity for exploitation. Fortunately, we have a very strong current theory and we can use it to put bounds on the time and energy an AI would require to violate the conditions of validity (eg. create black holes, etc.) For long term safety and stability, humanity will certainly have to put restrictions on those capabilities, at least until the underlying physics is fully understood.
Andrew, thanks for the post. Here's a first response, I'll eventually post more. I agree with most of your specific comments but come to a different set of conclusions (some of which Ben Goldhaber and I spelled out in this LessWrong post: "Provably Safe AI: Worldview and Projects" and some which I discuss in this video "VAISU Provably Safe AI") I agree that formal methods are not yet widely used and that provides some evidence that there might be big challenges to doing so. But these are extraordinary times, both because we are likely to face much more powerful adversaries over the next decade and because our AI systems are likely to become much more mathematically sophisticated over the same time period. It is certainly valuable to understand why formal methods haven't yet been adopted widely but it appears that both the driving necessity and the enabling technology may be rapidly changing soon.
I also agree that the transition period from our current technology to provably safe replacements is likely to be very challenging and may involve great hardship and loss of life. I believe we have a great opportunity right now to solve the problems you mention and others before they are critical. One overriding perspective in Max and my approach is that we need to design our systems so that their safety *can be formally verified*. In software, people often bring up the halting problem as an argument that general software can't be verified. But we don't need to verify general software, we are *designing* our systems so that they can be verified. That issue is also critically important for physical systems. And we are likely to have a challenging period where older systems that haven't been designed for verified safety are vulnerable to attack.
You say:
In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets.
We fortunately live in an era where we *do* have a complete formal understanding of the fundamental laws of physics. Sean Carrol summarizes this nicely in this paper: "The Quantum Field Theory on Which the Everyday World Supervenes" in which he argues that the Standard Model of particle physics plus Einstein's general relativity completely describes all physical phenomena in ordinary human experience (ie. away from black holes, neutron stars, and the early universe and at energies less than 10^11eV, for comparison chemical bonds are less than 10eV and reactions in the sun are 10^8eV). Much of applied physics is about precise formal models of higher level phenomena (eg. fluids, elastic materials, plasmas, etc.) as beautifully summarized in Kip Thorne's book: "Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics". Each of the engineering disciplines (eg. Mechanical Engineering, Electrical Engineering, Chemical Engineering, etc.) has their own formal models of their domain along with design rules for safe systems. They have extensive experiments and formal arguments which ground their models in fundamental physics. Fortunately, many of these disciplines are currently working to represent their fundamental models in formal proof assistants like Lean. For example, here is a chemical physics group who is doing that: "Formalizing chemical physics using the Lean theorem prover"
The Lean theorem prover is one of several powerful powerful proof assistants (others include Coq, Isabelle, MetaMath, HOL Light, and others) which are rich enough to encode all of mathematics including all the foundations of physics, engineering, computer science, economics, etc. The Lean library "Mathlib 4": contains most of an undergraduate mathematics curriculum, some frontier mathematical areas, and is beginning to include probability theory, physics, and other disciplines. Many AI theorem provers are training on this library and are rapidly improving in generating Lean proofs. For example, "DeepSeek-Prover-V1.5" is SOTA for open source provers: "DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search" and commercial AI companies like Harmonic appear to be making rapid progress: "One month in - A new SOTA on MiniF2F and more" Several experts working on this area say that we should expect LLM-based AI systems (including "agentic" add-ons like Monte Carlo Tree Search, etc.) to reach human PhD level at theorem proving by 2026. I believe that will open the floodgates of safety and other other applications.
For example, this came out yesterday showing the sorry state of today's voting machines: "The nation’s best hackers found vulnerabilities in voting machines — but no time to fix them" https://www.politico.com/news/2024/08/12/hackers-vulnerabilities-voting-machines-elections-00173668 And they've been working on making voting machines secure for decades!
Yes, I think there are many, many more possibilities as these systems get more advanced. A number of us are working to flesh out possible stable endpoints. One class of world states are comprised of humans, AIs, and provable infrastructure (including manufacturing and datacenters). In that world, humans have total freedom and abundance as long as they don't harm other humans or the infrastructure. In those worlds, it appears possible to put absolute limits on the total world compute, on AI agency, on Moloch-like economic competition, on harmful evolution, on uncontrolled self-improvement, on uncontrolled nanotech, and on many other of today's risks.
But everything depends on how we get from the present state to that future state. I think the Aschenbrenner timelines are plausible for purely technological development. But the social reality may change them drastically. For example, if the social upheaval leads to extensive warfare, I would imagine that GPU datacenters and chip fabs would be likely targets. That could limit the available compute sufficiently to slow down AI development dramatically.
I've spoken about some of the future possibilities in different talks like this one:
but I think it's easier for most people to think about the more immediate issues before considering the longer term consequences.
Given this lower level of ambition, I fail to see how this approach or agenda can be positioned as an alternative to pausing AI.
Oh, any low level of ambition is just for this initial stage, without which the more advanced solutions can't be created. For example, without a provably unbreakable hardware layer supporting provably correct software, you can't have social contracts which everyone can absolutely trust to be carried out as specified.
And I think pausing would be great if it were possible! I signed the Pause letter and I think it generated good and important discussion, but I don't think it slowed down AI development. And the instability of pausing means that at best we get a short extension of the AI development timeline. For that to be of long term value, humanity needs to use the extra time to do something! What should it do? It should flesh out all the pieces needed to effectively implement provable safety!
Intervals are often a great simple form of "enclosure" in continuous domains. For simple functions there is also "interval arithmetic" which cheaply produces a bounding interval on the output of a function given intervals on its inputs: https://en.wikipedia.org/wiki/Interval_arithmetic But, as you say, for complex functions it can blow up. For a simple example of why, consider the function "f(x)=x-x" evaluated on the input interval [0,1]. In the simplest interval arithmetic, the interval for subtraction has to bound the worst possible members of the intervals of its two inputs. In this case that would be a lower bound of "0-1" and an upper bound of "1-0" producing the resulting interval: [-1,1]. But, of course, "x-x" is always 0, so this is huge over-approximation. People have developed all kinds of techniques for capturing the correlations between variables in evaluating circuits on intervals. And you can always shrink the error by splitting the input intervals and doing "branch and bound". But all of those are just particular implementation choices in proving bounds on the output of the function. Advanced AI theorem provers (like AlphaProof) can use very sophisticated techniques to accurately get the true bound on the output of a function.
But, it may be that it's not a fruitful approach to try to bound the behavior of complex neural nets such as transformers. In our approach, we don't need to understand or constrain a complex AI generating a solution or a control policy. Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can't do that, then it has no business taking actions in a dangerous setting.
I agree that the human responses to all of this are the great unknown. The good news is that human programmers don't need to learn to do formal verification, they just need to realize that it's important and to get the AIs to do it. My suspicion is that these big shifts won't happen until we start having AI-caused disasters. It's interesting to look at the human response to the CrowdStrike disaster and other recent security issues to get some sense of how the responses might go. It's also interesting to see the unbelievable speed with which Microsoft, Meta, Google, etc. have been building huge new GPU datacenters in response to the projected advances in AI.
I am a big fan of verification in principle, but don't think it's feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I'd be happy to make bets that
- (3) won't happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
- (8) won't be attempted, or will fail at some combination of design, manufacture, or just-being-pickable.
- with some mutually agreed operationalization against the success of other listed ideas with a physical component
I'm afraid I'm not into betting and I'm not saying these projects will happen, but rather that they should happen. And I'm not saying they are easy but rather that they are important and I don't see any huge barriers to doing them.
In general, non-deterministic hardware isn't a problem for formalization. We're not trying to simulate these systems, we're trying to put bounds on their possible behaviors. In general, proofs work with enclosures that are guaranteed to contain the actual behavior. As far as I know, the various floating point formats are all precisely defined, so there's no problem there. I know that some processor specs stated that their implementation of functions like sin and cos were only guaranteed up the last bit or something. That's an annoyance but not a real problem for formalization. A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don't overlap the unsafe region.
Probabilistic programs do provide greater challenges. It's not a problem to formally model and prove properties of probability distributions. But it's often no longer possible to provide 100% guarantees of desired safety properties. The approach of precise probabilistic theories like "Probably approximately correct learning" (https://en.wikipedia.org/wiki/Probably_approximately_correct_learning) is to provide a provable bound on the probability of failure.
There is subtlety in adversarial situations where there is a difference between uncertain actions which the adversary can't affect for which an ordinary probabilistic analysis suffices and those where the adversary can influence apparently random variables. For example, I believe we need to create formal models of all of the engineering disciplines (eg. mechanical engineering, electrical engineering, chemical engineering, etc.). The rules that those engineers use are already quite formal and there are heuristic arguments as to why they are grounded in fundamental physics. But in an adversarial context, an adversary may be able to influence a deeper level of physics in a way that can break the assumptions of the model. For example, mechanical engineers have models of the elastic properties of steel beams but if an adversary can alter the manufacture of those beams in a way that makes them vulnerable to certain kinds of failure, then the mechanical engineer's assumptions can be invalidated. So the formal models also need to model potential vulnerabilities in the assumptions of the higher level models.
I believe that these developments are not optional. It's just a question of whether humanity prepares now for what's coming or waits until after the basic assumptions of our infrastructure and society are upended.
Do you envision being able to formalize social systems and desirable properties for them, based on current philosophical understanding of topics like human values/goals and agency / decision theory? I don't, and also think philosophical progress on these topics is not happening fast enough to plausibly solve the problems in time.
I think there are both short-term and long-term issues. In the short-term society is going to have to govern a much wider range of behaviors than we have today. In the next few years we are likely to have powerful AIs taking actions online and in the physical world both autonomously or at the behest of humans. I think that the most immediate issue is that today's infrastructure is likely to be very vulnerable to AI-powered attacks. The basic mechanisms of society are likely to be very vulnerable: voting, communication, banking, etc. Already we are seeing AI-driven hacking, impersonation, fake news, spam, etc. Next year's more powerful AI agents will likely take these to the next level.
Social media sites are already getting overwhelmed by spam, fake images, fake videos, blackmail attempts, phishing, etc. The only way to counteract the speed and volume of massive AI-driven attacks is with AI-powered defenses. These defenses need rules. If those rules aren't formal and proven robust, then they will likely be hacked and exploited by adversarial AIs. So at the most basic level, we need infrastructure rules which are provably robust against classes of attacks. What those attack classes are and what properties those rules guarantee is part of what I'm arguing we need to be working on right now. It looks like the upheaval is coming whether we want it or not. And the better prepared we are for it, the better the outcome.
Over the longer term (which may not be that long if major societal transformation comes around 2030), I totally agree that there are many important philosophical issues that we will need to address. And I agree that we shouldn't try to solve all those problems right now. But we should create a path along which we can make the choices and which is implementable with the technology which appears imminent. Today, democratic societies use processes like voting to choose governance rules. The blockchain world has explored various precise variants of those procedures. I think a good path forward might involve precisely formalizing effective mechanisms like prediction markets, quadratic voting, etc. so that we have confidence that future social infrastructure actually implements it.
It appears that rather simple provable hardware (what Max and I called "provable contracts") can enable the solution of many social dilemmas to produce better outcomes for all. Provable hardware can impose joint constraints on the choices of multiple parties and can include externalities in the execution of contracts. If implemented correctly, these could dramatically improve the efficiency, fairness, and effectiveness of human societies.
Timelines are certainly short. But if AI systems progress as rapidly as Aschenbrenner and others are predicting, then most of the work will be done by AIs. Of course, we have to know what we want them to accomplish! And that's something we can work on before the AIs are dangerous or powerful enough to build the safe infrastructure.
Max and I argued that systems which are not built with security proofs will likely have vulnerabilities and that powerful AI's can search form them using formal models and are likely to find them. So if things unfold in an adversarial way, any software, hardware, or social systems not built with provable safety will likely be exploited.
Absolutely! In different circumstances different classes of adversarial capabilities are appropriate. Any formal security claim will always be relative to a specified class of attackers. For the most secure infrastructure which must be secure against any physical agent (including the most powerful AGIs), you would show security properties against any agent which obeys the laws of physics. Max and I talk about "provable contracts" which sense any form of physical tampering and erase internal cryptographic keys to guarantee that contained information cannot be extracted.
Today, many residential front doors are vulnerable to "bump key" attacks which involve inserting a key with all the key bittings filed down and tapping it with a mallet to rapidly open the door with no skill. There would be tremendous benefit to mechanical locks which are not vulnerable to this or more sophisticated purely mechanical attacks.
It would also be great to create designs which are provably invulnerable to the kind of sophisticated key you suggest. A general approach is to physically isolate the part of the lock which senses the pattern on the key from the part which checks that that pattern is correct. In today's locks, those two functions happen together and lockpickers exploit that. In the kinds of designs I'd prefer to see, no information about the correct key is accessible from the part of the lock accessible to the attacker.
"‘Sinkclose’ Flaw in Hundreds of Millions of AMD Chips Allows Deep, Virtually Unfixable Infections" https://archive.ph/MsA0m
Thanks for the links! Yes, the cost of errors is high for chips, so chip companies have been very supportive of formal methods. But they've primarily focused on the correctness of digital circuits. There are many other physical attack surfaces.
For example, most DRAM chips are vulnerable to the "Rowhammer" attack https://en.wikipedia.org/wiki/Row_hammer and as memories get denser, they become more vulnerable. This involves memory access patterns which flip bits in neighboring rows which can be exploited to discover cryptographic keys and break security assumptions in processors. The solutions so far are fairly hacky and attackers have found work arounds. A full solution likely will involve a formal model including the electromagnetic and material physics of the memory chip. This might be used to provably characterize all problematic access patterns and to provably translate software into equivalent versions which provably don't produce the dangerous accesses.
Another example from yesterday is the AMD "Sinkclose" flaw which affects all AMD chips back to 2006 and makes them vulnerable to attacks after which "You basically have to throw your computer away." In a world of powerful AI-enabled attackers, that kind of flaw is unlikely to be acceptable.
With the recent developments in drones, humanoid robots, and autonomous vehicles, we are likely to need provable constraints on the behavior of a wide variety of physical systems. In the paper with Max we argue that this is possible and that the laws of physics provide an absolute constraint on the behavior of AIs. But it's not easy! And the time is likely short.
Fortunately, for coarse "guardrails" the specs are pretty simple and can often be reused in many contexts. For example, all software we want to run should have proofs that: 1) there aren't memory leaks, 2) there aren't out-of-bounds memory accesses, 3) there aren't race conditions, 4) there aren't type violations, 5) there aren't buffer overflows, 6) private information is not exposed by the program, 7) there aren't infinite loops, etc. There should be a widely used "metaspec" for those criteria which most program synthesis AI will have to prove their generated code satisfies. Similarly, there are common constraints for many physical systems: eg. robots, cars, planes, boats, etc. shouldn't crash into things or harm humans, etc. The more refined the rules are, the more subtle they become. To prevent existentially bad outcomes, I believe coarse constraints suffice. But certainly we eventually want much more refined models of the world and of the outcomes we seek. I'm a fan of "Digital Twins" of physical systems which allow rules and constraints to be run in simulation which can help in choosing specifications. We certainly want those simulations to be trusted. which can be achieved by proving the code actually simulates the systems it claims to. Eventually it would be great to have fully trusted AI as well! Mechanistic Interpretability should be great for that! I'm just reading Anthropic's recent nice advances in that. If that continues to make progress then it makes our lives much easier but it doesn't eliminate the need to ensure that misaligned AGI and malicious AGI don't cause harm. The big win with the proof checking and the cryptographic hardware we propose is that we can ensure that even powerful systems will obey rules that humanity selects. If we don't implement that kind of system (or something functionally equivalent), then there will be dangerous pathways which malicious AGI can exploit to cause great harm to humans.
Peter, thanks again for starting this discussion! Just a few caveats in your summary. We don't depend on trustable AIs! One of the absolutely amazing and critical characteristics of mathematical proof is that anybody, trusted or not, can create proofs and we can check them without any need to trust the creator. For example, MetaMath https://us.metamath.org/ defines an especially simple system for which somebody has written a 350 line Python proof checker which can check all 40,000 MetaMath theorems in a few seconds. We need to make sure that that small Python program is correct but beyond that don't need to trust any AIs! It certainly would be nice to have trustable AIs but I think it is a mistake to depend on them. We are already seeing a wide range of open source AIs created by different groups with different intentions. And there are many groups working to remove the safety features of commercial Large Language Models and text-to-image models.
The core of our proposal is to put provable guardrails around dangerous technologies that AGI might seek to control. As many commenters have pointed out we need to figure out what those are! Some current examples include DNA synthesis machines, nukes, military hardware of all kinds, drones capable of dispersing dangerous payloads, etc. And we need to create rules for which actions with these systems are safe! But we already do that today for human use. The CDC has detailed rules for biohazard labs. The initial implementation of our proposal will most likely be digitally checkable versions of those existing human rules. But certainly we need to extend them to the new environment as AIs become more powerful. Any action remotely capable of human extinction should become a primary focus. I believe establishing the rules for high risk actions should be an immediate priority for humanity. And extending those rules to other systems to enable human flourishing should be a close second.
We can use untrusted AIs in much of this work! We simply require them to generate formal proofs of any designs or software that they create. These can then be easily and rigorously checked and they can be used without concern for the properties of the AI which generated them.
Check out this great paper: "From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought" https://arxiv.org/abs/2306.12672 It proposes "probabilistic programming" as a formal "Probabilistic Language of Thought" (PLoT) with precise formal Bayesian reasoning. They show in 4 domains how a large language model can convert an informal statement or chain of reasoning into a precise probabilistic program, do precise Bayesian reasoning on that, and then convert the results back into informal natural language.
Thanks Peter for the post and thank you everyone for the comments. Let me try to clarify a bit. We're looking for an absolute foundation of trust on top of which we can build a world safe for AGI. We believe that we need to adopt a "Security Mindset" in which AGI's either on their own or controlled by malicious humans need to be considered a full on adversaries. The only two absolute guarantees that we have are mathematical proof and the laws of physics. Even the most powerful AGI can't prove a falsehood or violate the laws of physics. Based on these we show how to build a network of "provable contracts" that provide absolute guardrails around dangerous actions. As a commenter points out, figuring out which actions we need to protect against and what the rules should be is absolutely essential and not at all trivial! In fact, I believe that should be one of the primary activities of humanity for the next decade!
Very interesting thought experiment!
One place where it might fall down is that our disutility for causing deaths is probably not linear in the number of deaths, just as our utility for money flattens out as the amount gets large. In fact, I could imagine that its value is connected to our ability to intuitively grasp the numbers involved. The disutility might flatten out really quickly so that the disutility of causing the death of 3^^^^3 people, while large, is still small enough that the small probabilities from the induction are not overwhelmed by it.