No77e's Shortform

post by No77e (no77e-noi) · 2023-02-18T12:45:42.224Z · LW · GW · 8 comments

8 comments

Comments sorted by top scores.

comment by No77e (no77e-noi) · 2023-03-13T08:26:24.319Z · LW(p) · GW(p)

Iff LLM simulacra resemble humans but are misaligned, that doesn't bode well for S-risk chances. 

Replies from: no77e-noi
comment by No77e (no77e-noi) · 2023-03-13T10:27:35.754Z · LW(p) · GW(p)

Waluigi effect also seems bad for s-risk. "Optimize for pleasure, ..." -> "Optimize for suffering, ...".

comment by No77e (no77e-noi) · 2023-03-13T08:25:26.711Z · LW(p) · GW(p)

An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.

A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.

comment by No77e (no77e-noi) · 2023-03-09T15:51:33.296Z · LW(p) · GW(p)

This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It's a waste that they haven't been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could've walked away with a succint and precise understanding about where the disagreements are and why.

Replies from: no77e-noi
comment by No77e (no77e-noi) · 2023-03-09T15:57:46.643Z · LW(p) · GW(p)

We should implement Paul Christiano's debate game with alignment researchers instead of ML systems

comment by No77e (no77e-noi) · 2023-02-26T18:17:20.919Z · LW(p) · GW(p)

If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless. 

But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that's less hopeless.

The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...

comment by No77e (no77e-noi) · 2023-02-18T12:45:42.426Z · LW(p) · GW(p)

As a failure mode of specification gaming, agents might modify their own goals. 

As a convergent instrumental goal, agents want to prevent their goals to be modified.

I think I know how to resolve this apparent contradiction, but I'd like to see other people's opinions about it.

comment by No77e (no77e-noi) · 2023-02-18T15:41:30.399Z · LW(p) · GW(p)

Why this shouldn't work? What's the epistemic failure mode being pointed at here?