joanv

Posts
Comments

Posts

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs 2024-09-25T14:52:48.263Z

Comments

Comment by joanv (joan-velja) on Jesse Hoogland's Shortform · 2025-02-12T12:15:07.347Z · LW · GW

>Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.

I would argue that instead human priors serve as a mechanism to help the search process, as it's being shown with cold-started reasoning models: they bake-in some reasoning traces that the model can then learn to exploit via RL. While this is not very bitter lesson-esque, the solution space is so large that it'd probably be quite difficult to do so without the cold start phase (although R1-zero kind of hints at this being possible). Maybe we have not yet thrown as much compute at the problem to do this search from scratch effectively.

Comment by joanv (joan-velja) on Daniel Tan's Shortform · 2025-01-23T10:13:04.229Z · LW · GW

What is the threat model you are probing for here?

The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.

> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’

While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.

> Both of these [steganography inducing methods] involve finetuning the model

Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.

Comment by joanv (joan-velja) on Implications of the inference scaling paradigm for AI safety · 2025-01-14T10:26:51.218Z · LW · GW

Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., "Hmmm") as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.

User info

Posts

Comments