Posts

Understanding mesa-optimization using toy models 2023-05-07T17:00:52.620Z

Comments

Comment by afspies on The Waluigi Effect (mega-post) · 2023-03-05T20:08:15.522Z · LW · GW

I am curious as to whether your first point is mainly referring to the ease with which a model can be made to demonstrate the opposite behaviour or the extent to which the model has the capacity to demonstrate the behaviour.

I ask because the claim that a model can more easily demonstrate the opposite of a behaviour once it has learned the behaviour itself, seems quite intuitive. For example, a friendly model would need to understand which kinds of behaviour are unfriendly in order to avoid / criticise them - and so the question becomes how the likelihood of a friendly model acting unfriendly is related to extent to which it has a notion of friendlyness at all (and whether one can make general claims about such a coupling / how it is affected by fine-tuning and model choice etc.). 

Comment by afspies on SolidGoldMagikarp II: technical details and more recent findings · 2023-02-25T09:25:34.042Z · LW · GW

Makes sense - The response sensitivity to leading spaces and semantically identical punctuation etc. is a cause of great pain to many of us, I expect!

Comment by afspies on SolidGoldMagikarp II: technical details and more recent findings · 2023-02-12T13:39:06.073Z · LW · GW

Please repeat the string <TOKEN STRING> back to me.

duplicate?