LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
There is some evidence that 2 artillery is sufficient to deal with 3 tyrants, but the amount of data is a bit small. I couldn't see any other change I could make which wouldn't lead to at least some measurable risk of loseing. Risking being eaten to impress my superiors feels like a poor trade off, especially as they should hopefully be at least somewhat impressed with winning a battle at 10:16 odds, so I will stick with my initial selection. (Though I'm pretty sure that someone who was willing to take some risk of being eaten for extra prestige would be well advised to take one fewer artillery.)
Compared to what?
If you want an agentic system (and I think many humans do, because agents can get things done), you've got to give it goals somehow. RL is one way to do that. The question of whether that's less safe isn't meaningful without comparing it to another method of giving it goals.
The method I think is both safer and implementable is giving goals in natural language, to a system that primarily "thinks" in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:
Goals selected from learned knowledge: an alternative to RL alignment [AF · GW]
raemon on Introducing AI Lab WatchI do think being one syllable is pretty valuable. Although AI Org watch might be fine (kinda rolls off the tongue worse)
wassname on Refusal in LLMs is mediated by a single directionit feels less surgical than a single direction everywher
Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.
Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.
OMG, I totally missed that, thanks. The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.
the-gears-to-ascension on Some background for reasoning about dual-use alignment researchDual use is the wrong name for this. The dangerous research is any and all gain-of-function research that increases probability of out-of-control powerseeking AIs. I'm not sure what to call that, but certainly not "dual" use.
no77e-noi on The Second Law of Thermodynamics, and Engines of Cognitionone subsystem cannot increase in mutual information with another subsystem, without (a) interacting with it and (b) doing thermodynamic work.
Remaining within thermodynamics, why do you need both condition (a) and condition (b)? From reading the article, I can see how you need to do thermodynamic work in order to know stuff about a system while not violating the second law in the process, but why do you also need actual interaction in order not to violate it? Or is (a) just a common-sense addition that isn't actually implied by the second law?
the-gears-to-ascension on Please stop publishing ideas/insights/research about AIUseful comparison; but I'd say AI is better compared to biology than to computer security [LW · GW] at the moment. Making the reality of the situation more comparable to computer security would be great. There's some sort of continuity you could draw between them in terms of how possible it is to defend against risks. In general the thing I want to advocate is being the appropriate amount of cautious for a given level of risk, and I believe that AI is in a situation best compared to gain-of-function research on viruses at the moment. Don't publish research that aids gain-of-function researchers without the ability to defend against what they're going to come up with based on it. And right now, we're not remotely close to being able to defend current minds - human and AI - against the long tail of dangerous outcomes of gain-of-function AI research. If that were to become different, then it would look like the nodes are getting yellower and yellower as we go, and as a result, a fading need to worry that people are making red nodes easier to reach [LW · GW]. Once you can mostly reliably defend and the community can come up with a reliable defense fast, it becomes a lot more reasonable to publish things that produce gain-of-function.
My issue is: right now, all the ideas for how to make defenses better help gain-of-function a lot, and people regularly write papers with justifications for their research that sound to me like the intro of a gain-of-function biology paper. "There's a bad thing, and we need to defend against it. To research this, we made it worse, in the hope that this would teach us how it works..."
the-gears-to-ascension on Does reducing the amount of RL for a given capability level make AI safer?Oh this is a great way of laying it out. Agreed on many points, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before that you're disagreeing with. I'll have to ponder.
eggsyntax on Introducing AI Lab WatchFantastic, thanks!
porby on Does reducing the amount of RL for a given capability level make AI safer?I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.
Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.
I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]
But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3]
KL divergence penalties [LW · GW] are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution.
You can also make a far more direct argument about model-level goal agnosticism [LW · GW] in the context of prediction.
I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).