Posts

Is there a manifold plot of all people who had a say in AI alignment? 2023-03-31T21:50:09.570Z

Comments

Comment by skulk-and-quarrel on On AutoGPT · 2023-04-14T03:49:38.551Z · LW · GW

I think we ban research on embodied AI and let GPT takeover Metaverse. Win win

Edit: and also nuke underground robot research facilities? Doesn’t seem like a great idea in hindsight

Comment by skulk-and-quarrel on A rough and incomplete review of some of John Wentworth's research · 2023-03-28T22:52:00.409Z · LW · GW

Disclaimer: I have not read the Wentworth's post or the linked one but I know (little) about finite-sample and asymptotic bounds.

(He had another few versions, allegedly with fuller proofs, though I was not able to understand them and focused on this one.)

I think the key point of the statement is "any finite-entropy function ". This makes sure that the "infinity" in the sampling goes away. That being said, it should be possible to extend the proof to non-independent samples, Cosma Shalizi has done a ton of work on this.

Comment by skulk-and-quarrel on RLHF does not appear to differentially cause mode-collapse · 2023-03-20T22:48:30.770Z · LW · GW

The impact in ChatGPT could be potentially due to longer prompts or the "system prompt". It would be great to test that in a similar analysis

Comment by skulk-and-quarrel on Super-Luigi = Luigi + (Luigi - Waluigi) · 2023-03-17T18:48:45.439Z · LW · GW

What about the luigis and waluigis in different languages, cultures, religions? Ones that can be described via code? It feels like you can always invent new waluigis unless the RLHF killed all of the waluigis from your pre-training data (whatever that means) 

The token limit (let's call that ) is your limit here, you just need to create a waluigi in  steps, so that you can utilize him for the last  steps. I think this eventually breaks down to something about computational bounds, like can you create a waluigi in this much time

Comment by skulk-and-quarrel on Super-Luigi = Luigi + (Luigi - Waluigi) · 2023-03-17T18:31:58.473Z · LW · GW

Second insight:
If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.

You need to do this for all (x,y) pairs of Luigis and Waluigis. How do you enumerate all the good things in the world with their evil twins, and then somehow compare the internal embedding shift against all of these directions? Is that even feasible? You probably would just get stuck.