LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

raemon on Introducing AI Lab Watch

I do think being one syllable is pretty valuable. Although AI Org watch might be fine (kinda rolls off the tongue worse)

wassname on Refusal in LLMs is mediated by a single direction

it feels less surgical than a single direction everywher

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.

OMG, I totally missed that, thanks. The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.

the-gears-to-ascension on Some background for reasoning about dual-use alignment research

Dual use is the wrong name for this. The dangerous research is any and all gain-of-function research that increases probability of out-of-control powerseeking AIs. I'm not sure what to call that, but certainly not "dual" use.

no77e-noi on The Second Law of Thermodynamics, and Engines of Cognition

one subsystem cannot increase in mutual information with another subsystem, without (a) interacting with it and (b) doing thermodynamic work.

Remaining within thermodynamics, why do you need both condition (a) and condition (b)? From reading the article, I can see how you need to do thermodynamic work in order to know stuff about a system while not violating the second law in the process, but why do you also need actual interaction in order not to violate it? Or is (a) just a common-sense addition that isn't actually implied by the second law?

the-gears-to-ascension on Please stop publishing ideas/insights/research about AI

Useful comparison; but I'd say AI is better compared to biology than to computer security [LW · GW] at the moment. Making the reality of the situation more comparable to computer security would be great. There's some sort of continuity you could draw between them in terms of how possible it is to defend against risks. In general the thing I want to advocate is being the appropriate amount of cautious for a given level of risk, and I believe that AI is in a situation best compared to gain-of-function research on viruses at the moment. Don't publish research that aids gain-of-function researchers without the ability to defend against what they're going to come up with based on it. And right now, we're not remotely close to being able to defend current minds - human and AI - against the long tail of dangerous outcomes of gain-of-function AI research. If that were to become different, then it would look like the nodes are getting yellower and yellower as we go, and as a result, a fading need to worry that people are making red nodes easier to reach [LW · GW]. Once you can mostly reliably defend and the community can come up with a reliable defense fast, it becomes a lot more reasonable to publish things that produce gain-of-function.

My issue is: right now, all the ideas for how to make defenses better help gain-of-function a lot, and people regularly write papers with justifications for their research that sound to me like the intro of a gain-of-function biology paper. "There's a bad thing, and we need to defend against it. To research this, we made it worse, in the hope that this would teach us how it works..."

the-gears-to-ascension on Does reducing the amount of RL for a given capability level make AI safer?

Oh this is a great way of laying it out. Agreed on many points, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before that you're disagreeing with. I'll have to ponder.

eggsyntax on Introducing AI Lab Watch

Fantastic, thanks!

porby on Does reducing the amount of RL for a given capability level make AI safer?

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.

Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.

I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.^[1]^[2]

But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.^[3]

^{^}
KL divergence penalties [LW · GW] are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution.
^{^}
You can also make a far more direct argument about model-level goal agnosticism [LW · GW] in the context of prediction.
^{^}
I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).

eggsyntax on eggsyntax's Shortform

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.

With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.

But what would that even be with an LLM? You obviously (unless you're Matt Watkins) can't show it tokens it hasn't seen, so 'OOD' would have to be about particular strings of tokens. It can't be simply about strings of tokens it hasn't seen, because I can give it a string I'm reasonably confident it hasn't seen and it will behave reasonably, eg:

Define a fnurzle as an object which is pink and round and made of glass and noisy and 2.5 inches in diameter and corrugated and sparkly. If I'm standing in my living room and holding a fnurzle in my hand and then let it go, what will happen to it?

…In summary, if you let go of the fnurzle in your living room, it would likely shatter upon impact with the floor, possibly emitting noise, and its broken pieces might scatter or roll depending on the surface.

(if you're not confident that's a unique string, add further descriptive phrases to taste)

So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it's seen? That feels kind of forced, and it's certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word 'transom' followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like 'équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis' for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language -- is it ever OOD? The issue seems vexed.

zach-stein-perlman on Introducing AI Lab Watch

Yep, lots of people independently complain about "lab." Some of those people want me to use scary words in other places too, like replacing "diffusion" with "proliferation." I wouldn't do that, and don't replace "lab" with "mega-corp" or "juggernaut," because it seems [incorrect / misleading / low-integrity].

I'm sympathetic to the complaint that "lab" is misleading. (And I do use "company" rather than "lab" occasionally, e.g. in the header.) But my friends usually talk about "the labs," not "the companies." But to most audiences "company" is more accurate.

I currently think "company" is about as good as "lab." I may change the term throughout the site at some point.

LessWrong 2.0 Reader

Archive

Recent comments