Matthew Khoriaty's Shortform

post by Matthew Khoriaty (matthew-khoriaty) · 2025-02-21T00:02:43.362Z · LW · GW · 6 comments

Contents

6 comments

6 comments

Comments sorted by top scores.

comment by Matthew Khoriaty (matthew-khoriaty) · 2025-02-21T00:02:43.359Z · LW(p) · GW(p)

RL techniques (reasoning + ORPO) has had incredible success on reasoning tasks. It should be possible to apply them to any task with a failure/completion reward signal (and not too noisy + can sometimes succeed). 

Is it time to make the automated Alignment Researcher?

Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.

More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?

Replies from: Viliam, cubefox
comment by Viliam · 2025-02-21T10:35:49.848Z · LW(p) · GW(p)

More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?

Could be a problem with not enough learning data -- you get banned for making the bad comments before you get enough feedback to learn how to write the good ones? Also, people don't necessarily upvote based on your comment alone; they may also take your history into account (if you were annoying in the past, they may get angry also about a mediocre comment, while if you were nice in the past, they may be more forgiving). Also, comments happen in a larger context -- a comment that is downvoted in one forum could be upvoted in a different forum, or in the same forum but a different thread, or maybe even the same thread but a different day (for example, if your comment just says what someone else already said before).

Maybe someone is already experimenting with this on Facebook, but the winning strategy seems to be reposting cute animal videos, or posting an AI generated picture of a nice landscape with the comment "wow, I didn't know that nature is so beautiful in <insert random country>". (At least these seem to be over-represented in my feed.)

Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.

Sounds like a good way to get banned. But as a thought experiment, you might start at some place where people judge content less strictly, and gradually move towards more difficult environments? Like, before LW, you should probably master the "change my view" subreddit. Before that, probably Hacker News. I am not sure about the exact progression. One problem is that the easier environments might teach the model actively bad habits that would later prevent it from succeeding in the stricter environments.

But, to state the obvious, this is probably not a desirable thing, because the model could get high LW karma by simply exploiting our biases, or just posting a lot (after it succeed to make a positive comment on average).

Replies from: matthew-khoriaty
comment by Matthew Khoriaty (matthew-khoriaty) · 2025-03-06T22:56:30.427Z · LW(p) · GW(p)

The facebook bots aren't doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It's just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.

 

Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of "write good posts" before starting the RL, though I didn't find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO's citations for "Vote". Lots of results, though none of them have many citations.)

comment by cubefox · 2025-02-21T14:57:36.555Z · LW(p) · GW(p)

Reinforcement Learning is very sample-inefficient compared to supervised learning, so it mostly just works if you have some automatic way of generating both training tasks and reward, which scales to millions of samples.

Replies from: matthew-khoriaty
comment by Matthew Khoriaty (matthew-khoriaty) · 2025-03-01T03:33:55.621Z · LW(p) · GW(p)

Deepseek R1 used 8,000 samples. s1 used 1,000 offline samples. That really isn't all that much.

Replies from: cubefox
comment by cubefox · 2025-03-01T07:20:16.009Z · LW(p) · GW(p)

S1 is apparently using supervised learning:

We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces (...). After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K (...).

But 8000 samples like R1 is a lot less than I thought.