Posts

Comments

Comment by myyycroft on Orthogonal's Formal-Goal Alignment theory of change · 2024-11-14T09:40:23.374Z · LW · GW

I endorse alignment proposals which aim to be formally grounded; however, I'd like to know some concrete ideas on how to handle the common hard subproblems.

In the beginning of the post, you say that you want to 1) build a formal goal which leads to good worlds when pursued and 2) design an AI which pursues this goal.

  • It seems to me that 1) includes some form of value learning (since we speak about good worlds). Can you give a high-level overview on how concretely you plan to deal with complexity and fragility of value?
  • Now suppose 1) is solved. Can you give a high-level overview on how do you plan to design the AI? In particular, how to make it aimable?
Comment by myyycroft on jacquesthibs's Shortform · 2024-09-06T10:47:08.263Z · LW · GW

GPT-2 1.5B is small by today's standards. I hypothesize people are not sure if findings made for models of this scale will generalize to frontier models (or at least to the level of LLaMa-3.1-70B), and that's why nobody is working on it.

However, I was impressed by "Pre-Training from Human Preferences". I suppose that pretraining could be improved, and it would be a massive deal for alignment.