Posts

Comments

Comment by M.M. (Peggy) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-03-11T20:14:41.321Z · LW · GW

Interesting that the two questions producing the highest misalignment are the unlimited power prompts (world ruler, one wish).

Comment by M.M. (Peggy) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-03-11T19:55:12.740Z · LW · GW

Re "universal representation of behaviour which is aligned / not aligned"--reminiscent of an idea from linguistics.  Universal Grammar provides a list of parameters; all languages have the same list.  (Example:  can you drop a subject pronoun?  In English the answer is no, in Spanish the answer is yes.)  Children start with all parameters on the default setting; only positive evidence will induce them to reset a parameter.  (So for pro-drop, they need to hear a sentence--as in Spanish--where the subject pronoun has been dropped.)  Evidence came from the mistakes children make when learning a language, and also from creole languages, which were said to maintain the default parameter settings.  I don't know if this idea is still current in linguistics.