LLM keys - A Proposal of a Solution to Prompt Injection Attacks
post by Peter Hroššo (peter-hrosso) · 2023-12-07T17:36:23.311Z · LW · GW · 2 commentsContents
The situation The problem Solution Limitations None 2 comments
Disclaimer: This is just an untested rough sketch of a solution which I believe should work. I'm posting it mostly to crowdsource reasons why this wouldn't work. Motivated by Amjad Masad and Zvi conjecturing it might be fundamentally unsolvable.
The situation
- we, as LLM creators, want to have the ability to set some limitations to the LLM generation
- at the same time we want to give our users as much freedom as possible to steer the LLM generation within the given bounds
- we want to do it by means of a system prompt
- which will be prepended to any user-LLM interaction
- which will not be accessible or editable by users
- the model is accessible by the users only via API/UI
The problem
- the LLM has fundamentally no way of discriminating, what is input by LLM creators and what is input by users pretending to be the LLM creators
Solution
- introduce two new special tokens unused during training, which we will call the "keys"
- during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
- finetune the LLM to behave in the following way:
- generate text as usual, unless an input attempts to modify the system prompt
- if the input tries to modify the system prompt, generate text refusing to accept the input
- don't give users access to the keys via API/UI
Limitations
- the proposed solution works only when the LLM is walled behind an API
- otherwise the user will have access to the model and thus also to the keys, which will give them full control over the model
2 comments
Comments sorted by top scores.
comment by cwillu (carey-underwood) · 2023-12-07T17:43:02.471Z · LW(p) · GW(p)
- introduce two new special tokens unused during training, which we will call the "keys"
- during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
- finetune the LLM to behave in the following way:
- generate text as usual, unless an input attempts to modify the system prompt
- if the input tries to modify the system prompt, generate text refusing to accept the input
- don't give users access to the keys via API/UI
Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.
Replies from: peter-hrosso↑ comment by Peter Hroššo (peter-hrosso) · 2023-12-07T18:36:01.761Z · LW(p) · GW(p)
Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?