reallyeli's Shortform

reallyeli

reallyeli's Shortform

post by reallyeli · 2025-03-04T00:47:34.431Z · LW · GW · 2 comments

2 comments

2 comments

Comments sorted by top scores.

comment by reallyeli · 2025-04-10T03:37:56.867Z · LW(p) · GW(p)

If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.

How bad of a position does this put the untrusted model in, when it's trying to write code that does bad stuff? Some sub-questions towards answering that question:

How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
How feasible is it for the spec writer to collude with the code writer?

comment by reallyeli · 2025-03-04T00:47:34.429Z · LW(p) · GW(p)

A good ask for frontier AI companies, for avoiding massive concentration of power, might be:

"don't have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically"

since this seems both important and likely to be popular.

reallyeli's Shortform

Contents

2 comments