Posts

Image Hijacks: Adversarial Images can Control Generative Models at Runtime 2023-09-20T15:23:48.898Z
Tensor Trust: An online game to uncover prompt injection vulnerabilities 2023-09-01T19:31:24.863Z
Examples of Prompts that Make GPT-4 Output Falsehoods 2023-07-22T20:21:39.730Z

Comments

Comment by Luke Bailey (luke-bailey) on Image Hijacks: Adversarial Images can Control Generative Models at Runtime · 2023-09-22T01:07:19.271Z · LW · GW

I think this is an interesting point. We are actually conducting some follow-up work seeing how robust our attacks are to various additional "defensive" perturbations (e.g. downscaling, adding noise).  As Matt notes, when doing these experiments it is important to see how such perturbations also affect the models general vision language modeling performance. My prior right now is that using this technique it may be possible to defend against the L infinity constrained images, but possibly not the moving patch attacks that showed higher level features. In general adversarial attacks are a cat and mouse game, so I expect that if we can show you can defend using techniques like this, a new training scheme will come along that is able to make adversaries that are robust to such defenses. It is worth noting also that most VLMs only accept small low resolution images already. For example LLaVA (with llama 13b), which is state of the art for open source, only accepts ~200 * 200 pixel sized image, so the above example is not necessarily a fair one.