Posts

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs 2024-09-25T14:52:48.263Z

Comments

Comment by joanv (joan-velja) on Daniel Tan's Shortform · 2025-01-23T10:13:04.229Z · LW · GW

What is the threat model you are probing for here? 

The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution. 

> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’

While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.

> Both of these [steganography inducing methods] involve finetuning the model

Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.

Comment by joanv (joan-velja) on Implications of the inference scaling paradigm for AI safety · 2025-01-14T10:26:51.218Z · LW · GW

Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., "Hmmm") as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.