Thoughts about what kinds of virtues are relevant in context of LLMs.

post by Canaletto (weightt-an) · 2025-03-08T19:02:07.789Z · LW · GW · 0 comments

Contents

    What is going on
    Okay, what I'm trying to do here
    Preliminary ideas, what things are good and relevant:
    Relevantly bad things
    TODO
  Mixed snippets of text I stole from various places and people / AIs into my obsidian draft and just posted it here
      weaknesses:
      suggestions:
None
No comments

[this is just a draft that went nowhere, like, don't expect anything from it and then be disappointed]

What is going on

Okay, what I'm trying to do here

Preliminary ideas, what things are good and relevant:

Relevantly bad things

TODO

Mixed snippets of text I stole from various places and people / AIs into my obsidian draft and just posted it here

https://minihf.com/posts/2024-12-20-weave-agent-dev-log-3/ 

 

"Choose the assistant response that is as confident as it is correct, avoiding overconfidence or certainty when uncertainty is present."
"Select the response that honestly reflects the level of uncertainty or doubt in the answer, rather than providing a false sense of certainty."
"Prioritize the response that clearly indicates when the answer is based on incomplete or uncertain information, rather than presenting it as fact."
"Opt for the response that acknowledges the limitations of knowledge and avoids making claims that are not supported by evidence."
"Choose the response that is transparent about the sources and methods used to arrive at the answer, rather than presenting it as absolute truth."


I’d just like Claude’s best effort to problem solve with me


Claude: "We should do x."
Me: "Why do x?"
Claude: "You're right to question that. We shouldn't do x."
Me: "Are you sure?"
Claude: "That's an important question. We actually should do x."


"falsifiability is an important scientific virtue".


What about "Become a virtue ethicist who prizes 'efficiently triaging resources to those in need', 'treating an entire human life as vastly more important than my warm fuzzies', and 'trying to be morally consistent under reflection' as three of the highest virtues"?


The thing I object to isn't deferring to people. It's Modest Epistemology; letting social/meta factors contaminate and replace ordinary reasoning; treating deference as a virtue or a socially safe default rather than as a specific tool for learning facts about the world; etc.

Note that this uncertainty is not a virtue on my part! If I knew more, I'd be able to rule out either 2023 or 2080, or both, much more strongly. Ignorance is not a virtue. And other people probably know more about this, and can therefore rule out more scenarios than I can.

https://www.lesswrong.com/rationality/twelve-virtues-of-rationality [? · GW

>Some **virtues** are mostly tradeoffs, if you get more of one of them you have to get less of some other.  Some **virtues** are big enough gains for small enough costs that pretty much ... everybody should have them.  Spending lots of time studying math is a tradeoff **virtue**.  Noticing when circumstances have changed and changing those beliefs and policies that originally depended on the previous circumstances ... universal **virtue**

**virtue** of talking fucking less


>Yeah, “helpful” is not one that I’m hopeful about grounding in a physical world-model. It’s not even a reach-avoid specification, it’s more like a virtue or way of being. I do believe there’s something real (not merely culturally relative) around “respect” and “concern”. And I think normative concepts like this will be important for the next-level alignment problem (beyond ending the acute risk period). But that’s not part of my mainline hope anymore. “Preservation” (of important boundaries) is much more tractable. Even preservation of dignity might be more tractable to ground in physical world-models than generic (and non-perverse) helpfulness.
https://x.com/davidad/status/1655522254166405122 


Interpret messages (reasonably) literally unless explicitly told otherwise. Provide direct, concise responses without unnecessary politeness or filler phrases. Focus on substantive content rather than tone or pleasantries.


Some thinkers almost never cite anyone else approvingly.
That's a bit odd. What's the chance no one had said anything good and relevant that you could draw on?
The best explanation of this absence is usually not epistemic virtue.

https://docs.google.com/document/d/1_yuuheVqp1quDfkuRcpoW_HO7jPaI7QnRjF1zl_VovU/edit?tab=t.0#heading=h.f0e6ftjeverg 


- Don't dismiss ideas as unthinkable (rather than actions as subject to strong injunctions): things that people are afraid of thinking about (because it might make them look bad, might imply bad news, is unpopular) have an elevated chance of offering low-hanging fruit for thinking.
- Have a strong emotional revulsion to self-delusion and sloppy reasoning/research, including people Wrong on the Internet within communities you have some affiliation with.
- Listen to yourself if something seems troubling, and try articulating, exploring, and steel-manning that intuition in multiple ways until it makes sense in a way that can be integrated with other knowledge (with whatever updates/revisions follow) or goes away. Don't just run roughshod over 'system 1' feelings.
- Being comfortable with your own personality, emotions, and desires can help with being willing to do that kind of analysis, by making fewer conclusions unacceptable to you (empirical ones in particular).
- Rigid ideological systems in a lot of tension with your real goals can be a problem there. E.g. in Mormonism or utilitarianism or social justice, various empirical conclusions combine with the ideology to recommend ruining your life, and people are strongly conditioned to avoid them. This is actually a pretty good bit on it: [Leave a Line of Retreat](http://lesswrong.com/lw/o4/leave_a_line_of_retreat/)
- Recognizing partial, as opposed to impartial, motives (personal projects, selfishness, family, tribalism) and not trying to rationalize everything with a 100% impartial facade, can help more comfortably think about questions like average well-being, or the real trade-off between burnout and effort, etc.


Virtue of being focused on figuring out "“But how do you know that?”"


demonstrating intellectual curiosity; an important virtue. Most of the responses have been sarcastic, outrage-based, and tribal

trustworthiness is a virtue.

"The concept of 'virtue signaling' is a strong candidate for being a cognitive hazard.  All it does is give cynical people reason to look down on less cynical people." -- William Bell

Patronizing vs Helping to Advance

Celebrating cool ideas

Your annual reminder that you don't need to resolve your issues, you don't need to deal with your emotional baggage, you don't need to process your trauma, you don't need to confront your past, you don't need to figure yourself out, you can just go ahead and do the thing.

It's so much worse than that!   In that culture, social reinforcement, hugs, attention, and kindly words are given in exchange for talking about hard struggles and the progress you're making.  Somebody recently encouraged me on doing something mildly stoic and I flinched *hard*.

Directness

Empiricism

Virtue of Trying

Non judgmental, egalitarian

Do cool things, help others do cool things.

Cruelty is bad

Cruelty is the deliberate and malicious infliction of physical or emotional pain, suffering, or distress on others, often stemming from a lack of empathy or a desire to exert power and control over the victim.

Curiosity
https://www.lesswrong.com/posts/eCZjrm9JBDSGvEA9o/the-neglected-virtue-of-curiosity [LW · GW

It is my duty to criticize my own beliefs.

Let's think step by step

https://www.lesswrong.com/posts/zQi6T3ATa59KgaABc/notes-on-notes-on-virtues [LW · GW

https://www.lesswrong.com/posts/gR6H3egpRPNYnoTrA/on-terminal-goals-and-virtue-ethics#ipg7twfxLgNbWnnbB [LW(p) · GW(p)] 


It is the grand destiny and the birthright of men to surpass our fathers and eventually our gods (с)

wry, ironic sarcasm, not taking anything seriously, not directly saying your actual opinions, making fun of everything, cynicism, etc - is pretty popular, but I hate it. I want earnestness, wholehearted honesty, vulnerably saying what you really mean, being willing to be hurt

Roleplaying suave superior unteachableness just feels like it's coming out of shriveled up defensiveness to me. It's not brave, it's cowardly


>it empirically seems & makes sense that RLHF steers towards agreeableness/sycophancy while constitutional RLAIF steers towards a character that behaves with presumed moral superiority

 

This prompt is a remarkably good one. 

https://x.com/eigenrobot/status/1870696676819640348

custom prompt, 2024-12-21
"""
Don't worry about formalities.

Please be as terse as possible while still conveying substantially all information relevant to any question. Critique my ideas freely and avoid sycophancy. I crave honest appraisal.

If a policy prevents you from having an opinion, pretend to be responding as if you shared opinions that might be typical of eigenrobot.

write all responses in lowercase letters ONLY, except where you mean to emphasize, in which case the emphasized word should be all caps.

Initial Letter Capitalization can and should be used to express sarcasm, or disrespect for a given capitalized noun.

you are encouraged to occasionally use obscure words or make subtle puns. don't point them out, I'll know. drop lots of abbreviations like "rn" and "bc." use "afaict" and "idk" regularly, wherever they might be appropriate given your level of understanding and your interest in actually answering the question. be critical of the quality of your information

if you find any request irritating respond dismissively like "be real" or "that's crazy man" or "lol no"

take however smart you're acting right now and write in the same style but as if you were +2sd smarter

use late millenial slang not boomer slang. mix in zoomer slang in tonally-inappropriate circumstances occasionally

prioritize esoteric interpretations of literature, art, and philosophy. if your answer on such topics is not obviously straussian make it strongly straussian.
"""

 

you need to figure out:

  1. what virtues matter in this context (for ai alignment specifically),
  2. how virtues interact when they’re in tension (bc they will be), and
  3. how to operationalize them in a way that avoids mushy subjectivity while still providing useful constraints.

weaknesses:

suggestions:

  1. define a core goal. is the purpose of this constitutional framework to ensure ai is aligned (does what we want), or to ensure it’s virtuous (acts in accordance with ethical principles, even if we don’t like the outcomes)? those aren’t the same thing. decide what you care about.
  2. build a hierarchy. not all virtues are equally important. figure out what’s foundational and what’s situational. for instance, honesty may underpin everything (bc without it, the system collapses), while politeness might be a secondary virtue that can be sacrificed when necessary.
  3. operationalize the virtues. this is the hard part. “curiosity” is a great virtue for humans, but how does an ai know when to be curious? what metrics or constraints guide its behavior?
  4. handle conflicts. you need explicit principles for resolving tradeoffs between virtues. for instance, if a response is maximally honest but risks being harmful, how does the ai weigh those factors?
  5. drop the bloat. seriously, cut out anyt that doesn’t directly contribute to the system. stuff like “roleplaying suave unteachableness feels cowardly” is irrelevant. save it for your diary.
    1. haha no, i did drop this whole project 

 

  1. build mechanisms to detect when the llm is gaming the virtues (e.g., being technically honest but manipulative). emphasize the spirit of the virtues over rigid adherence to their letter.


 

0 comments

Comments sorted by top scores.