My Central Alignment Priority (2 July 2023)

post by Nicholas / Heather Kross (NicholasKross) · 2023-07-03T01:46:01.764Z · LW · GW · 1 comments

Contents

  Quick Notation
  The Logic
  The Bottom Line
    I, personally, right now, should have the key research focus of "solve problem 2".
  More footnotes about my Alignment Timeline specifically.
None
1 comment

Epistemic status: The further you scroll, the more-important the points are.

I continue my deconfusion from my alignment timeline [LW · GW] post. Goal: figure out which sub-problem to work on first in AI alignment.

Quick Notation

1 = steering [LW · GW]

2 = goal

1.1k no steering found

1.2k poor steering found

1.3g steering lets us do pivotal

1.4g steering + existing goal inputs

1.5g steering not needed, alignment by default

2.1k no goal found, even with steering

2.2k poor goal found, combined with steering

2.2g goal is good/enough

The Logic

Let's say that my effort single-handedly changed the relative balance of research between 1 and 2. So we ignore scenarios where my work doesn't do anything. (By same intuition, we ignore 1.5g, since that doesn't require effort.)

If I research 1 (first) and X happens, what happens?

If I research 2 (first) and X happens, what happens?

The Bottom Line

I, personally, right now, should have the key research focus of "solve problem 2".

If I get interesting ideas about problem 1, I should write them down privately and... well, I'm not sure, but probably not publish them quickly and openly.

This will be the case until/unless something happens that would have made me change my above logic.

Some things that have not happened, but which I would expect to change my mind on the above points:

More footnotes about my Alignment Timeline specifically.

1 comments

Comments sorted by top scores.

comment by Nicholas / Heather Kross (NicholasKross) · 2023-08-29T17:36:37.156Z · LW(p) · GW(p)

NOTE: I used "goal", "goals", and "values" interchangably in some writings such as this, and this was a mistake. A more consistent frame would be "steering vs target-selection" (especially as per the Rocket Alignment analogy).