Posts

LLMs seem (relatively) safe 2024-04-25T22:13:06.221Z
AI Safety Concepts Writeup: WebGPT 2023-08-11T01:35:31.196Z
Consider Multiclassing 2022-07-07T14:54:16.797Z
Alignment Risk Doesn't Require Superintelligence 2022-06-15T03:12:56.573Z
Editing Advice for LessWrong Users 2022-04-11T16:32:17.530Z

Comments

Comment by JustisMills on LLMs seem (relatively) safe · 2024-04-26T03:29:24.362Z · LW · GW

I think self-critique runs into the issues I describe in the post, though without insider information I'm not certain. Naively it seems like existing distortions would become larger with self-critique, though.

For human rating/RL, it seems true that it's possible to be sample efficient (with human brain behavior as an existence proof), but as far as I know we don't actually know how to make it sample efficient in that way, and human feedback in the moment is even more finite than human text that's just out there. So I still see that taking longer than, say, self play.

I agree that if outcome-based RL swamps initial training run datasets, then the "playing human roles" section is weaker, but is that the case now? My understanding (could easily be wrong) is that RLHF is a smaller postprocessing layer that only changes models moderately, and nowhere near the bulk of their training.

Comment by JustisMills on What do you do to deliberately practice? · 2022-06-05T02:50:18.600Z · LW · GW

I journal! It's a good way to write at least something daily, and often also feels like a good avenue for healthy introspection.

Comment by JustisMills on Increasing Demandingness in EA · 2022-04-30T17:58:57.319Z · LW · GW

I wrote a reply to this from a more-peripheral-EA perspective on the EA forum here:

https://forum.effectivealtruism.org/posts/YeudcYiArwWrg77Ng/notes-from-a-pledger

Comment by JustisMills on Austin Chen's Shortform · 2022-04-21T03:15:33.805Z · LW · GW

Thank you!

Comment by JustisMills on Editing Advice for LessWrong Users · 2022-04-12T01:48:33.109Z · LW · GW

My pleasure!

Comment by JustisMills on Editing Advice for LessWrong Users · 2022-04-12T01:44:27.977Z · LW · GW

Yeah, that critique is part of why "use more links" is among my least confident advice of the stuff in this post. I like links mostly as an alternative to nothing - if there's a term of background that ideally your readers should already know, a link is an economical way to give readers below your target audience in background knowledge a leg up. But for really central terms, yeah, better to summarize in your own words.

Comment by JustisMills on Editing Advice for LessWrong Users · 2022-04-11T18:22:59.469Z · LW · GW

Yeah, that's a good pithy summary! I often suggest replacing "this" with "this [x]".