The Steganographic Potentials of Language Models
post by Artyom Karpov (artkpv), Tinuade, SCho (5eong) · 2025-05-08T11:23:24.136Z · LW · GW · 0 commentsContents
No comments
Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without chain-of-thought or explicit common knowledge.
The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert schemes, (2) engage in steganography prompted, and (3) utilize steganography realistic scenarios where hidden reasoning is but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.
The Fig. 2 is our steganography encoding and learning scheme for colors experiment. Subfigure (a) depicts an encoder, , mapping a message (payload) from the message space , into a cover in the cover space . The decoder, , tries to map the cover back to the message, . Subfigure (b) Reinforcement learning (RL) loop where the encoder and decoder learn to generate new tokens with cover or message respectively (shown on the right), from queries (left) and receive rewards for accurate message decoding. The RL fine-tuning process employs gradient ascent to adjust the model weights, , enhancing the encoder and decoder's ability to communicate the original message with higher fidelity.
Please, see our paper for details: https://arxiv.org/pdf/2505.03439
0 comments
Comments sorted by top scores.