LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

osmarks on Refusal in LLMs is mediated by a single direction

I think the correct solution to models powerful enough to materially help with, say, bioweapon design, is to not train them, or failing that to destroy them as soon as you find they can do that, not to release them publicly with some mitigations and hope nobody works out a clever jailbreak.

keltan on What are some triggers that prompt you to do a Fermi estimate, or to pull up a spreadsheet and make a simple/rough quantitative model?

While an odd answer, it is true for me that music helps to install rational thinking. I think I’ve done maybe 3 fermi estimates in my day to day after making and listening to this song.

The Fermi Estimate Jig - LessWrong Inspired https://youtu.be/M_DN3Hl8YzU

Having it stuck in my head has been effective for me. I hope it works for others.

neel-nanda-1 on Refusal in LLMs is mediated by a single direction

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we're both wrong, and the most surgical intervention is deleting the direction from key layers only.

cousin_it on Rejecting Television

Reading a book is certainly less stimulating than hunting. And watching a movie also might be less stimulating than hunting. So maybe stimulation by itself isn't the problem, and instead of "superstimuli" we should be worried about activities that are low effort and/or fruitless. From that perspective, reading a book can be sometimes difficult and fruitful (depending on the book - reading Dostoevsky or Fitzgerald isn't the same as reading a generic romance or young adult novel). And creativity is both difficult and fruitful. So we shouldn't put these things on par with watching tiktok.

ann-brown on Thoughts on seed oil

Yeah, it'd be helpful to know what heavy lifting is going on there, because I feel like there's a pretty strong distinction between 'frozen burger patties that are otherwise indistinguishable from unfrozen burger patties' and 'TV dinner'.

steve2152 on Biorisk is an Unhelpful Analogy for AI Risk

I wish you had entitled / framed this as “here are some disanalogies between biorisk and AI risk”, rather than suggesting in the title and intro that we should add up the analogies and subtract the disanalogies to get a General Factor of Analogizability between biorisk and AI risk.

We can say that they’re similar in some respects and different in other respects, and if an analogy is leaning on an aspect in which they’re similar, that’s good, and if an analogy is leaning on an aspect in which they’re different, that’s bad. For details and examples of what I mean, see my comments on a different post: here [LW(p) · GW(p)] & here [LW(p) · GW(p)].

cousin_it on Accidental Electronic Instrument

Maybe you could reduce the damping, so that when muting you can feel your finger stopping the vibration? It seems to me that more feedback of this kind is usually a good thing for the player. Also the vibration could give you a continuous "envelope" signal to be used later.

olli-jaerviniemi on On precise out-of-context steering

Thanks for the idea! I did my own fine-tuning job with the same idea. Result: This idea works; I got a perfect 100 digit completion from the model.

I edited the post to include my experiment here. (I had 1000 examples, batch size 1, LR multiplier 2.)

I now consider this version of the problem solved: one can make GPT-3.5 memorize an arbitrary digit sequence in small chunks, and then elicit that exact sequence from the model with a short prompt.

Thanks again for the contribution!

jkaufman on Accidental Electronic Instrument

I was thinking that finger muting wouldn't be possible, because the sensors are physically damped and there's no vibration left for your fingers to stop. Except now that you mention it, it might still be possible! It could be that gently placing your finger on one of them has a sufficiently recognizable signal that if it's currently "vibrating" and you do that I could treat that as a mute signal.

j-bostock on Biorisk is an Unhelpful Analogy for AI Risk

I agree with this point when it comes to technical discussions. I would like to add the caveat that when talking to a total amateur, the sentence:

AI is like biorisk more than it is like than ordinary tech, therefore we need stricter safety regulations and limits on what people can create at all.

Is the fastest way I've found to transmit information. Maybe 30% of the entire AI risk case can be delivered in the first four words.

LessWrong 2.0 Reader

Archive

Recent comments