Hypothetical: what would you do?

jns

Hypothetical: what would you do?

post by JNS (jesper-norregaard-sorensen) · 2023-08-03T22:39:55.026Z · LW · GW · No comments

This is a question post.

  Answers
    9 Charlie Steiner
None
No comments

Let’s pretend I have a semi rigors model that lays out why RLHF is doomed to fail and also that it negatively affects model performance (including why it does so)

Let’s go further into lala land and pretend that I have an architectural plan that does much better, very transparent, steerable and corrigible, can be deployed and used without changing or retraining the base LLM.

There are some downsides like requires more compute at inference time, not provable bulletproof, likely breaks in SI regime and definitely breaks under self improvement (so very definitely NOT an alignment proposal).

Short term this looks beneficial, also looks like shortening timelines, and extremely unlikely to advance the AI safety field (in the direction of what we ultimately want and need).

What should I do, if I ever happened to be in such a situation?

Prototype it, limited access with the expressed purpose of breaking stuff (black box, absolutely no architectural information provided).
Write it up and publish.
Forget about it, smarter people must have already thought of it, and since it’s not a thing, I am clearly wrong.
Forget about it, only helps capabilities.

Answers

answer by Charlie Steiner · 2023-08-03T22:51:50.401Z · LW(p) · GW(p)

I endorse the "overly galaxy brained strategy." If you actually understand why it's not useful even as a step towards some other alignment scheme that works for superintelligence, you should just drop it and think about other things.

However, usually things aren't so cut and dried. In the course of arriving at the epistemic state hypothesized above, it's probably a good idea to talk to some other safety researchers.

Generally if you think of something that's super useful for present-day systems, it's related to ideas that are useful for future systems. In that case, I endorse attempting to study your idea for its safety properties for a while and then eventually publishing (preferably just in time to scoop people in industry who are thinking about similar things :P ).

↑ comment by JNS (jesper-norregaard-sorensen) · 2023-08-05T19:07:02.089Z · LW(p) · GW(p)

My hypothetical self thanks you for your input and has punted the issues to the real me.

I feel like I need to dig a little bit into this

If you actually understand why it's not useful

Honestly I don't know for sure I do, how can I when everything is so ill-defined and we have so few scraps of solid fact to base things on.

That said, there is a couple of issue, and the major one is grounding, or rather the lack of grounding.

Grounding is IMO a core problem, although people rarely talk about, I think that mainly comes about because we (humans) seemingly have solved it.

I don't think that's the case at all, but because our cognition is pretty dang good at heuristics and error correction it rarely gets noticed, and even high impact situation are in the grand scheme of things not particular noteworthy.

The hypothetical architecture my hypothetical self has been working on, cannot do proper grounding^[1], the short version is it does something that looks a little like what humans do, so heuristics based and error prone.

Now that should scale somewhat (but how much?), but errors persist and at SI capability level the potential consequences look scary.

(the theme here is uncertainty and that actually crops up all over the place)

Anyways, an accord has been reached, the hypothetical architecture will exit mind space and be used to build a PoC.

Primary goal is to see does it work (likely not), secondary is to get a feel for how much fidelity is lost, tertiary goal is to try and gauge how manageable uncertainty and errors are.

No need to contemplate any further steps until I know if its workable in the real world.

(obviously if it works in some capacity, tangible results will be produced, and that could hopefully be used to engage with others and do more work on accessing potential usefulness for AI safety in general).

^{^}
Having actually looked at the problem, I don't think it is solvable, I mean solvable in the sense that its provable error free.

No comments

Comments sorted by top scores.

Hypothetical: what would you do?

Contents

Answers

No comments