Should you publish solutions to corrigibility?
post by rvnnt · 2025-01-30T11:52:05.983Z · LW · GW · No commentsThis is a question post.
Contents
Answers 4 Seth Herd 2 Noosphere89 2 Charlie Steiner 1 Satron 1 rvnnt None No comments
This question is partly motivated by observing recent discussions [LW · GW] about corrigibility and wondering to what extent the people involved have thought about how their results might be used. [LW · GW]
If there existed practically implementable ways to make AGIs corrigible to arbitrary principals, that would enable a wide range of actors to eventually control powerful AGIs. Whether that would be net good or bad on expectation would depend on the values/morality of the principals of such AGIs.
Currently it seems highly unclear what kinds of people we should expect to end up in control of corrigible ASIs, if corrigibility were practically feasible.
What (crucial [? · GW]) considerations should one take into account, when deciding whether to publish---or with whom to privately share---various kinds of corrigibility-related results?
Answers
It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity. Both seem unlikely at this point, to me. It's hard to tell when your alignment plan is good enough, and humans are foolishly optimistic about new projects, so they'll probably build AGI with or without a solid alignment plan.
So I'd say any and all solutions to corrigibility/control should be published.
Also, almost any solution to alignment in general could probably be used for control as well. And probably would be. See
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than [LW · GW]
And
https://www.lesswrong.com/posts/587AsXewhzcFBDesH/intent-alignment-as-a-stepping-stone-to-value-alignment [LW · GW]
↑ comment by rvnnt · 2025-01-30T14:36:15.138Z · LW(p) · GW(p)
It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity.
I'm assuming neither. I agree with you that both seem (very) unlikely. [1]
It seems like you're assuming that any humans succeeding in controlling AGI is (on expectation) preferable to extinction? If so, that seems like a crux: if I agreed with that, then I'd also agree with "publish all corrigibility results".
I expect that unaligned ASI would lead to extinction, and our share of the lightcone being devoid of value or disvalue. I'm quite uncertain, though. ↩︎
↑ comment by Seth Herd · 2025-01-30T16:38:56.884Z · LW(p) · GW(p)
I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies.
So I'd take my chances with a human overlord far before accepting extinction.
Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.
The answer depends on your values, and thus there isn't really a single answer to be said here.
Yes. Current AI policy is like people in a crowded room fighting over who gets to hold a bomb. It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.
That said, we're currently not near any satisfactory solutions to corrigibility. And I do think it would be better for the world if were easier (by some combination of technical factors and societal factors) to build AI that works for the good of all humanity than to build equally-smart AI that follows the orders of a single person. So yes, we should focus research and policy effort toward making that happen, if we can.
And if we were in that world already, then I agree releasing all the technical details of an AI that follows the orders of a single person would be bad.
↑ comment by rvnnt · 2025-01-30T14:30:39.271Z · LW(p) · GW(p)
It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.
I think there is a key disanalogy to the situation with AGI: The analogy would be stronger if the bomb was likely to kill everyone, but also had a some (perhaps very small) probability of conferring godlike power to whomever holds it. I.e., there is a tradeoff: decrease the probability of dying, at the expense of increasing the probability of S-risks from corrupt(ible) humans gaining godlike power.
If you agree that there exists that kind of tradeoff, I'm curious as to why you think it's better to trade in the direction of decreasing probability-of-death for increased probability-of-suffering.
So, the question I'm most interested in is the one at the end of the post[1], viz
What (crucial) considerations should one take into account, when deciding whether to publish---or with whom to privately share---various kinds of corrigibility-related results?
Didn't put it in the title, because I figured that'd be too long of a title. ↩︎
Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.
If a company fails at coming up with solutions for AGI corrigibility, they wouldn't stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.
Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.
Taking a stab at answering my own question; an almost-certainly non-exhaustive list:
-
Would the results be applicable to deep-learning-based AGIs?[1] If I think not, how can I be confident they couldn't be made applicable?
-
Do the corrigibility results provide (indirect) insights into other aspects of engineering (rather than SGD'ing) AGIs?
-
How much weight one gives to avoiding x-risks vs s-risks.[2]
-
Who actually needs to know of the results? Would sharing the results with the whole Internet lead to better outcomes than (e.g.) sharing the results with a smaller number of safety-conscious researchers? (What does the cost-benefit analysis look like? Did I even do one?)
-
How optimistic (or pessimistic) one is about the common-good commitment [LW · GW] (or corruptibility) of the people who one thinks might end up wielding corrigible AGIs.
Something like the True Name of corrigibility might at first glance seem applicable only to AIs of whose internals we have some meaningful understanding or control. ↩︎
If corrigibility were easily feasible, then at first glance, that would seem to reduce the probability of extinction (via unaligned AI), but increase the probability of astronomical suffering (under god-emperor Altman/Ratcliffe/Xi/Putin/...). ↩︎
No comments
Comments sorted by top scores.