Posts

Comments

Comment by BestJohn on The Waluigi Effect (mega-post) · 2023-03-06T14:02:32.845Z · LW · GW

What does trust mean, from the perspective of the LLM algorithm, in terms of a flattery-component? Do LLMs have a 'trustometer?' or can they evaluate some sort of stored world-state, compare the prompt, and come up with a "veracity" value that they use when responding the prompt?