LLM keys - A Proposal of a Solution to Prompt Injection Attacks

post by Peter Hroššo (peter-hrosso) · 2023-12-07T17:36:23.311Z · LW · GW · 2 comments

Contents

  The situation
  The problem
  Solution
  Limitations
None
2 comments

Disclaimer: This is just an untested rough sketch of a solution which I believe should work. I'm posting it mostly to crowdsource reasons why this wouldn't work. Motivated by Amjad Masad and Zvi conjecturing it might be fundamentally unsolvable.

The situation

The problem

Solution

Limitations

2 comments

Comments sorted by top scores.

comment by cwillu (carey-underwood) · 2023-12-07T17:43:02.471Z · LW(p) · GW(p)
  • introduce two new special tokens unused during training, which we will call the "keys"
  • during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
  • finetune the LLM to behave in the following way:
    • generate text as usual, unless an input attempts to modify the system prompt
    • if the input tries to modify the system prompt, generate text refusing to accept the input
  • don't give users access to the keys via API/UI

 

Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.

Replies from: peter-hrosso
comment by Peter Hroššo (peter-hrosso) · 2023-12-07T18:36:01.761Z · LW(p) · GW(p)

Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?