myyycroft

Posts
Comments

Posts

Comments

Comment by myyycroft on Orthogonal's Formal-Goal Alignment theory of change · 2024-11-14T09:40:23.374Z · LW · GW

I endorse alignment proposals which aim to be formally grounded; however, I'd like to know some concrete ideas on how to handle the common hard subproblems.

In the beginning of the post, you say that you want to 1) build a formal goal which leads to good worlds when pursued and 2) design an AI which pursues this goal.

It seems to me that 1) includes some form of value learning (since we speak about good worlds). Can you give a high-level overview on how concretely you plan to deal with complexity and fragility of value?
Now suppose 1) is solved. Can you give a high-level overview on how do you plan to design the AI? In particular, how to make it aimable?

Comment by myyycroft on jacquesthibs's Shortform · 2024-09-06T10:47:08.263Z · LW · GW

GPT-2 1.5B is small by today's standards. I hypothesize people are not sure if findings made for models of this scale will generalize to frontier models (or at least to the level of LLaMa-3.1-70B), and that's why nobody is working on it.

However, I was impressed by "Pre-Training from Human Preferences". I suppose that pretraining could be improved, and it would be a massive deal for alignment.

User info

Posts

Comments