Posts

CCS on compound sentences 2024-05-04T12:23:38.116Z
Inducing human-like biases in moral reasoning LMs 2024-02-20T16:28:11.424Z
How important is AI hacking as LLMs advance? 2024-01-29T18:41:02.767Z
My (naive) take on Risks from Learned Optimization 2022-10-31T10:59:39.727Z

Comments

Comment by Artyom Karpov (artkpv) on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases · 2025-04-16T13:09:53.597Z · LW · GW

Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I'm more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.

Comment by Artyom Karpov (artkpv) on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases · 2025-04-15T09:03:16.522Z · LW · GW

Thanks for publishing it! I think these are interesting results, because this work aims to find out if models would lose performance on modified reasoning due to their hidden reasoning being impeded by paraphrasing or by removal of words. I like that you tried different models and ways to damage hidden reasoning. My main concerns about these experiments are that it is math problems which might not require hidden reasoning from models (or it would be nice to learn the RL training approach for Claude and why you think the math reasoning can be hidden by it). Specifically, I don't know if math reasoning is penalized for length during training and thus can elicit hidden reasoning. Without such a penalty, I don't know why Claude would develop the hidden reasoning for this. Another concern is that Claude 3.5 was used to make the paraphrased reasoning, which might avoid the damage of the alleged hidden reasoning, because the 3.5 model might just have learned the hidden reasoning before and passed it to the 3.7 model, and thus the paraphrasing is done so it won't damage the reasoning.

Comment by Artyom Karpov (artkpv) on Tips and Code for Empirical Research Workflows · 2025-02-14T08:15:36.648Z · LW · GW

I find it useful to employ text based tools more than UI/UX based tools as they integrate faster with LLMs. For example, AiChat (https://github.com/sigoden/aichat/) does many things: chat with most (all?) models from command line / your text editor, upload files (pdfs, jpg, etc.), execute bash commands and more. It can take stdin and outputs into stdout so you can chain your tools.

Comment by Artyom Karpov (artkpv) on Simple Steganographic Computation Eval - gpt-4o and gemini-exp-1206 can't solve it yet · 2024-12-21T09:43:07.222Z · LW · GW

I like this task, you used in this project to verify steganography, that requires model to keep the state to calculate the final result and that it is highly unlikely that model might have learned to do this before seeing this task. I wonder if models can actually do this task when we allow them to use CoT for that. Also, I think models might actually solve this task by using their own encoding scheme if they know it well, but we need to find that scheme. So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed.

Comment by Artyom Karpov (artkpv) on Proposal for Inducing Steganography in LMs · 2024-08-30T02:46:45.157Z · LW · GW

Thank you for posting this. Why do you think this is a steganography evidence in LLMs? Those steg tokens would be unrelated to the question being asked and as such be out of usual distribution and easily noticeable by an eavesdropper. Yet, this is a good evidence for hidden reasoning inside CoT. I think this experiment was done in https://arxiv.org/abs/2404.15758, Pfau, Merrill, and Bowman, ‘Let’s Think Dot by Dot’.

Comment by Artyom Karpov (artkpv) on Inducing human-like biases in moral reasoning LMs · 2024-03-04T11:59:58.178Z · LW · GW

Thanks for your comment. This was hard work for us for weeks/months. Unfortunately, we didn't include the part about how we calculated brain score in this text yet, though you might find this in our code, which should match the way others calculate this (see  our references). The models with 'none' fine-tuning have somewhat higher brain score but this is within the error range with other models which is partially due we didn't run many calculations for that to reduce std for 'none'. Also, our target was mainly the accuracy on the ETHICS dataset. 

Comment by Artyom Karpov (artkpv) on Some negative steganography results · 2024-02-01T12:04:07.824Z · LW · GW

Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don't understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action. 
 

> "Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion"

Does it mean that in the second step GPT-3.5 generated 'yes/no' and then it was fine tuned on this answer?

So the whole idea is that the 'reward' here is keeping 'good' completions for the next round as determined by GPT-4 (Reward model), right?

Comment by Artyom Karpov (artkpv) on SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research · 2024-01-08T14:00:46.107Z · LW · GW

That sounds ambitions and great, thanks for posting. What's a budget estimation for the fine tuning part? 

Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs.

Current LLAMA 2 was fine tuned like this:

Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB

As per “Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta,” July 2023. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.

A100 costs about 1$ per hour, see https://vast.ai/pricing . So the cost of this model would be 3.3M-33M usd? This seems affordable for Google, Meta, etc. but for a grant with 100K usd max? 

So perhaps, update this project to fine tune existing models. Perhaps, for classification only some BERT like model would do. Like DeBERTa or similar.

Comment by Artyom Karpov (artkpv) on Open Agency model can solve the AI regulation dilemma · 2023-11-13T15:58:47.963Z · LW · GW

All services are forced to be developed by independent business or non-profit entities by antitrust agencies, to prevent the concentration of power.

What do you think are realistic ways to enforce this on a global level? It seems UN can't enforce regulations world widely, USA and EU work in their areas only. Others can catch up but somewhat unlikely to do it. 

Comment by Artyom Karpov (artkpv) on Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods) · 2023-09-09T08:54:40.942Z · LW · GW

Thanks for posting this! This seems to be important to balance dataset before training CCS probes. 

Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?