How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

oxidize

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

post by Oxidize · 2025-04-09T23:33:18.044Z · LW · GW · 8 comments

This is a question post.

8 comments

I initially assumed that the concept of reward-modelling would be something most Lesswrongers were very familiar with. After all, this is one of the best communities for conversing on the topic. And a large percentage of all posts are AI or doom-based

However, my skeptical side quickly kicked in and I started to doubt my initial assumption. As I realized that my assumption could be confirmation bias. I am personally highly invested in reward-modelling, and I tend to ignore information that has little to no relation to it. Additionally, I do not have access to any actual data on this, nor have I considered perspectives outside of my own.

Much of my beliefs around the concept of agency and reward-modelling are well modeled by the channel RobertMilesAI. How familiar is the community with the concepts expressed in this channel?

How aware is the community as a whole of the concept?

How interested is the community as a whole in the concept?

I would be very thankful for any replies. As I'm very invested in the concept of reward-modelling. So any outside perspectives on the topic are very valuable to me.

Answers

8 comments

Comments sorted by top scores.

comment by Viliam · 2025-04-10T13:51:42.778Z · LW(p) · GW(p)

The words don't ring a bell. You don't provide any explanation or reference, so I am unable to tell whether I am unfamiliar with the concept, or just know it under a different name (or no name at all).

Replies from: Oxidize

↑ comment by Oxidize · 2025-04-10T13:59:53.939Z · LW(p) · GW(p)

Thank you so much for the reply. You prevented me from making a pretty big mistake.

I'm defining reward-modelling as the manipulation of the direction of an agent's intelligence. From a goal-directed perspective.

So the reward-modelling of an AI might be the weights used, its training environment, mesa-optimization structure, inner-alignment structure, etc.

Or for a human, it might be genetics, pleasure, and pain.

Is there a better word I can use for this concept? Or maybe I should just make up a word?

Replies from: Viliam, mishka

↑ comment by Viliam · 2025-04-10T14:28:05.215Z · LW(p) · GW(p)

I approximately see the context of your question, but I am not sure what exactly are you talking about. Maybe please try less abstract, more ELI5, with specific examples what you mean (and the adjacent concepts that you don't mean)?

Is it about which forces direct agent's attention in short term? Like, a human would do X, because we have an instinct to do X, or because of a previous experience that doing X leads to pleasure, either immediately or in longer term. And avoid Y, because of innate aversion, or a previous experience that Y causes pain.

Seems to me that "genetics" is a different level of abstraction than "pleasure and pain". If I try to disentangle this, it seems to me that humans

immediately act on a stimulus (including internal, such as "I just remembered that...")
that is either a hardwired instinct, or learned i.e. a reaction stored in memory
the memory is updated by things causing pleasant or painful experience (again, including internal experience, e.g. hearing something makes me feel bad, even if the stimulus itself is not painful)
both the instincts and the organization of memory are determined by the genes
which are formed by evolution.

Do you want a similar analysis for LLMs? Do you want to attempt to make a general analysis even for hypothetical AIs based on different principles?

Is the goal to know all the levels of "where we can intervene"? Something like: "we can train the AI, we can upvote or downvote its answers, we can directly edit its memory..."?

(I am not an expert on LLMs, so I can't tell you more than the previous paragraph contains. I am just trying to figure out what is the thing you are interested in. It seems to me that people already study the individual parts of that, but... are you looking for some kind of more general approach?

Replies from: Oxidize, Oxidize

↑ comment by Oxidize · 2025-04-10T22:46:20.977Z · LW(p) · GW(p)

These are 6 sample titles I'm considering using. Any thoughts come to mind?

AI-like reward functioning in humans. (Comprehensive model)
Agency in humans
Agency in humans | comprehensive model of why humans do what they do
EA should focus less on AI alignment, more on human alignment
EA's AI focus will be the end of us all.
EA's AI alignment focus will be the end of us all. We should focus on human alignment instead

↑ comment by Oxidize · 2025-04-10T15:07:17.897Z · LW(p) · GW(p)

I'd say that the 80/20 of the concept is how reward & punishment affect human behavior.

Is it about which forces?
- I would say I'm referring to a combination of instinct, innate attraction/aversion, previous experience, decision-making, attention, and how they relate to each other in an everyday practical context.

Seems to me that "genetics"
- I would say your disentanglement is right on the money. Rather than making an analysis for LLMs, I'm particularly interested in fleshing out the inter relations between concepts as they relate to the human brain.

Do you want a similar analysis for LLMs?
I mean it from a high-level agency perspective, as opposed to in specific AI or machine learning contexts.

Goal?
My goal is to learn more about what information Lesswrongers use and are interested in so that I can better create a post for the community.

Adjacent concepts

Self-discipline
Positive psychology
Systems & patterns thinking
Maybe reward-functions?

Replies from: faul_sname

↑ comment by faul_sname · 2025-04-11T00:28:15.392Z · LW(p) · GW(p)

Can you give one extremely concrete example of a scenario which involves reward modeling, and point to the part of the scenario that you call "reward modeling"?

↑ comment by mishka · 2025-04-10T14:25:44.737Z · LW(p) · GW(p)

It should be a different word to avoid confusion with reward models (standard terminology for models used to predict the reward in some ML contexts)

Replies from: Oxidize

↑ comment by Oxidize · 2025-04-10T15:08:58.663Z · LW(p) · GW(p)

Thanks for this. Do you have any ideas of what terminology i should use if I mean models used to predict reward in human contexts?

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

Contents

Answers

8 comments