Posts

Jakub Halmeš's Shortform 2025-01-11T17:47:44.771Z
The Inner Alignment Problem 2024-02-24T17:55:55.649Z

Comments

Comment by Jakub Halmeš (jakub-halmes-1) on Jakub Halmeš's Shortform · 2025-01-29T16:52:31.107Z · LW · GW

I wonder if you could take the R1-Zero training regime, penalize/restrict using existing words from all languages (maybe only in the scratchpad, not the final response), and obtain a model which can solve math problems by reasoning in a non-existent language.

Comment by Jakub Halmeš (jakub-halmes-1) on Jesse Hoogland's Shortform · 2025-01-23T16:00:11.552Z · LW · GW

During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.

I also found this trade-off between human readability and performance noteworthy.

Comment by Jakub Halmeš (jakub-halmes-1) on Jakub Halmeš's Shortform · 2025-01-13T16:46:35.257Z · LW · GW

Yes, fair here means that their subjective EVs are equal. The post referenced in the sibling comment calls it "Even Odds", which is probably better.

Comment by Jakub Halmeš (jakub-halmes-1) on Jakub Halmeš's Shortform · 2025-01-13T16:43:44.876Z · LW · GW

I did not realize that. Thank you for the reference! 

Comment by Jakub Halmeš (jakub-halmes-1) on Jakub Halmeš's Shortform · 2025-01-11T17:47:44.982Z · LW · GW

If Alice thinks X happens with a probability of 20% while Bob thinks it's 40%, what would be a fair bet between them? 

I created a Claude Artifact, which calculates a bet such that the expected value is the same for both.

In this case, Bob wins if X happens (he thinks it's more likely). If Alice bets $100, he should bet $42.86, and the EV of such bet for both players (according to their beliefs) is $14.29. 

EDIT: I updated the calculator to handle the case when A's probability is higher than B's correctly.

Comment by Jakub Halmeš (jakub-halmes-1) on The Inner Alignment Problem · 2024-02-24T10:20:52.877Z · LW · GW

I wrote this mostly for personal purposes. I wanted to organize my thoughts about the problem while reading the paper, and publishing the notes, even if no one reads them, forces me to write more clearly and precisely.

I would like to get some feedback if there may be value in posts such as this one for other people. Please let me know! Thank you.