Posts
Comments
In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of "happiness", one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.
I don't quite understand the distinction your'e drawing here.
In both cases the AI was never trying to pursue happiness. In both cases it was pursuing something else, shmappiness, that correlated strongly with causing happiness in the training but not deployment environments. In both cases strength matters for making this disastrous as it will find more disastrous ways of pursuing schmappiness, It's just that the it is pursuing different varieties of shmappiness in the different cases.
I don't have a view on whether "goal misgeneralisation" as a term is optimal for this kind of thing.
Therefore rational beliefs are contagious, among honest folk who believe each other to be honest. And it’s why a claim that your beliefs are not contagious—that you believe for private reasons which are not transmissible—is so suspicious. If your beliefs are entangled with reality, they should be contagious among honest folk.
I don't get this inference. seems like the belief itself is the evidence -- and you entangle your friend with the object of your belief just by telling them your belief -- regardless if you can explain the reasons? (private beliefs seem to me suspicious on other grounds)
A knowledge explosion itself -- to the extent that that is happening -- seems like it could be a great thing. So for what it's worth my guess would be that it does make sense to focus on mitigating the specific threats that it creates (insofar as it does) so that the we get the benefits too.
I feel like I don't understand how this model explains the biggest mystery of expereinces sometimes having the reverse impact on your beliefs vs. what they should.
The more technical version of this same story is that habituation requires a perception of safety, but (like every other perception) this one depends on a combination of raw evidence and context. The raw evidence (the Rottweiler sat calmly wagging its tail) looks promising. But the context is a very strong prior that dogs are terrifying. If the prior is strong enough, it overwhelms the real experience. Result: the Rottweiler was terrifying. Any update you make on the situation will be in favor of dogs being terrifying, not against it!
Shouldn't your experience still be less terrifying than you expected it to be, becuase you're combining your dogs-are-terrifying-at-level-10 prior with the raw evidence (however constricted that channel is), so your update should still be against dogs being terrifying at level 10 (maybe level 9.9?)?
Maybe the answer is the thing smountjoy said below in response to your caption, that we don't have gradations in our beliefs about things -- dogs are either terrifying or not -- and then you have another example of dogs being terrifying to update with. FWIW that sounds unlikley to me -- people do seem to tend to have gradations in how evil republicans are or how terrifying dogs are in my experience. Though mabe that gets disabled in these cases, which seems like would explain it.