Posts

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations 2024-06-13T10:04:49.556Z
An Introduction to AI Sandbagging 2024-04-26T13:40:00.126Z
Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? 2024-01-29T00:24:27.706Z
Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models 2023-11-08T11:37:43.997Z
Understanding the Information Flow inside Large Language Models 2023-08-15T21:13:27.595Z
Explaining the Transformer Circuits Framework by Example 2023-04-25T13:45:27.000Z
Reflections On The Feasibility Of Scalable-Oversight 2023-03-10T07:54:07.189Z
An investigation into when agents may be incentivized to manipulate our beliefs. 2022-09-13T17:08:32.484Z
On Preference Manipulation in Reward Learning Processes 2022-08-15T19:32:51.110Z

Comments

Comment by Felix Hofstätter on The case for training frontier AIs on Sumerian-only corpus · 2024-01-16T05:15:10.629Z · LW · GW

What would it look like if such a model produces code or, more generally, uses any skill that entails using a domain specific language? I guess in the case of programming even keywords like "if" could be translated into Sumerian, but I can imagine that there are tasks that you cannot obfuscate this way. For example, the model might do math by outputting only strings of mathematical notation.  

Also, it seems likely that frontier models will all be multi-modal, so they will have other forms of communication that don't require language anyway. I suppose for image generating models, you could train the frontier model to apply an obfuscating filter and have the reporter restore the image iff it is relevant to medical research. 

Maybe my biggest concern is that I'm not convinced that we will be safe from manipulation by restricting the model's communication to being only about e.g. medical research. I don't have a precise story for how, but it seems plausible that a misaligned models could output instructions for medical research, which also have side-effects that work towards the model's secret goal.

Nitpicks aside, I like the idea of adding a safety layer by preventing the model to communicate directly with humans! And my concerns feel like technical issues that could easily be worked out given enough resources. 

Comment by Felix Hofstätter on No, really, it predicts next tokens. · 2023-04-19T08:28:43.595Z · LW · GW

While reading the the post and then some of the discussion I got confused about if it makes sense to distinguish between That-Which-Predicts and the mask in your model. 

Usually I understand the mask to be what you get after fine-tuning - a simulator whose distribution over text is shaped like what we would expect from some character like the the well-aligned chat-bot whose replies are honest, helpful, and harmless (HHH). This stand in contrast to the "Shoggoth" which is the pretrained model without any fine-tuning. It's still a simulator but with a distribution that might be completely alien to us. Since a model's distribution is conditional on the input, you can "switch masks" by finding some input, conditional on which the finetuned model's distribution corresponds to a different kind of character. You can also "awaken the Shoggoth" and get a glimpse of what is below the mask by finding some input, conditional on which the distribution is strange and alien. This is what we know as jail-breaking. In this view, the mask and whatever "is beneath" are different distributions but not different types of things. Taking away the mask just means finding an input for which the distribution of outputs has not been well shaped by fine-tuning.

The way you talk about the mask and That-Which-Predicts, it sounds like they are different entities which exist at the same time. It seem like the mask determines something like the rules that the text made up by the tokens should satisfy and That-Which-Predicts then predicts the next token according to the rules. E.g. the mask of the well-aligned chat-bot determines that the text should ultimately be HHH and That-Which-Predicts predicts the token that it considers most likely for a text that is HHH. 

At least this is the impression I get from all the "If the mask would X, That-Which-Predicts would Y" in the post and from some of your replies like "that sort of question is in my view answered by the "mask"" and I may misunderstand!

The confusion I have about this distinction, is that it does not seem to me like a That-Which-Predicts like what you are describing could ever be without a mask. The mask is a distribution and predicting the next token means drawing an element from that distribution. Is there anything we gain from conceiving of a That-Which-Predicts which seems to be a thing drawing elements from the distribution as separate from the distribution?

Comment by Felix Hofstätter on Anthropic's Core Views on AI Safety · 2023-03-09T21:22:39.589Z · LW · GW

Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.

You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models? 

Comment by Felix Hofstätter on How should DeepMind's Chinchilla revise our AI forecasts? · 2022-09-15T19:49:45.864Z · LW · GW

Very interesting, after reading chinchilla's wild implications I was hoping someone would write something like this!

If I understand point 6 correctly, then you are proposing that Hoffman's scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas required to improve data efficiency compared to if the available compute will continue to increase in the near to mid-term future. If the available data really becomes exhausted within the next few years, then improving the quality of models will be more dependend on such novel ideas under Hoffman's laws than under Kaplan's.