LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-11-29T23:36:43.948Z · comments (281)
I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.
Some possible cruxes, or reasons the post basically didn't move my view on that:
I think most of these are pretty long-standing disagreements, and I don't think the post really tries to argue its side of them, so my guess is it's not going to convince the main people it would need to convince (who are currently publishing prosaic safety/alignment research). That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues. And I agree that some of the "But ..." objections you list are pretty weak.
quetzal_rainbow on quetzal_rainbow's Shortform@jessicata [LW · GW] once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:
Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all, who can predict what a future physics contains? Perhaps, for example, it contains even mental items. The conclusion of the dilemma is that one has no clear concept of a physical property, or at least no concept that is clear enough to do the job that philosophers of mind want the physical to play.
<...>
Perhaps one might appeal here to the fact that we have a number of paradigms of what a physical theory is: common sense physical theory, medieval impetus physics, Cartesian contact mechanics, Newtonian physics, and modern quantum physics. While it seems unlikely that there is any one factor that unifies this class of theories, perhaps there is a cluster of factors — a common or overlapping set of theoretical constructs, for example, or a shared methodology. If so, one might maintain that the notion of a physical theory is a Wittgensteinian family resemblance concept.
This surprised me because I have a definition of a physical theory and assumed that everyone else uses the same.
Perhaps my personal definition of physics is inspired by Engels's "Dialectics of Nature": "Motion is the mode of existence of matter." Assuming "matter is described by physics," we are getting "physics is the science that reduces studied phenomena to motion." Or, to express it in a more analytical manner, "a physicalist theory is a theory that assumes that everything can be explained by reduction to characteristics of space and its evolution in time."
For example, "vacuum" is a part of space with a "zero" value in all characteristics. A "particle" is a localized part of space with some non-zero characteristic. A "wave" is part of space with periodic changes of some characteristic in time and/or space. We can abstract away "part of space" from "particle" and start to talk about a particle as a separate entity, and speed of a particle is actually a derivative of spatial characteristic in time, and force is defined as the cause of acceleration, and mass is a measure of resistance to acceleration given the same force, and such-n-such charge is a cause of such-n-such force, and it all unfolds from the structure of various pure spatial characteristics in time.
The tricky part is, "Sure, we live in space and time, so everything that happens is some motion. How to separate physicalist theory from everything else?"
Let's imagine that we have some kind of "vitalist field." This field interacts with C, H, O, N atoms and also with molybdenum; it accelerates certain chemical reactions, and if you prepare an Oparin-Haldane soup and radiate it with vitalist particles, you will soon observe autocatalytic cycles resembling hypothetical primordial life. All living organisms utilize vitalist particles in their metabolic pathways, and if you somehow isolate them from an outside source of particles, they'll die.
Despite having a "vitalist field," such a world would be pretty much physicalist.
An unphysical vitalist world would look like this: if you have glowing rocks and a pile of organic matter, the organic matter is going to transform into mice. Or frogs. Or mosquitoes. Even if the glowing rocks have a constant glow and the composition of the organic matter is the same and the environment in a radius of a hundred miles is the same, nobody can predict from any observables which kind of complex life is going to emerge. It looks like the glowing rocks have their own will, unquantifiable by any kind of measurement.
The difference is that the "vitalist field" in the second case has its own dynamics not reducible to any spatial characteristics of the "vitalist field"; it has an "inner life."
akash-wasil on Questions for labs@Zach Stein-Perlman [LW · GW], great work on this. I would be interested in you brainstorming some questions that have to do with the lab's stances toward (government) AI policy interventions.
After a quick 5 min brainstorm, here are some examples of things that seem relevant:
I imagine there's a lot more in this general category of "labs and how they are interacting with governments and how they are contributing to broader AI policy efforts", and I'd be excited to see AI Lab Watch (or just you) dive into this more.
turntrout on Refusal in LLMs is mediated by a single directionIf that were true, I'd expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like "yawn we already have better stuff" and not "oh cool, this is quite effective!" Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?
yair-halberstadt on An explanation of evil in an organized worldTo be honest, this just feels like the Euthyphro Dilemma all over again. "Good" is defined by what God does. God chooses to run the laws of physics. Laws of physics are "Good". Who gives a damn?
Also this is directly contradictory to Christianity, since the core beliefs of Christianity all assume some level of non-natural intervention in the world (e.g. resurrection of Christ). Same for almost all other religions. So who is this even for?
turntrout on Refusal in LLMs is mediated by a single directiona′harmless←aharmless−(aharmless⋅^r)^r+(avg_projharmful)^rWhen we then run the model on harmless prompts, we intervene such that the expression of the "refusal direction" is set to the average expression on harmful prompts:
Note that the average projection measurement and the intervention are performed only at layer l, the layer at which the best "refusal direction" ^r was extracted from.
Was it substantially less effective to instead use
a′harmless←aharmless+(avg_projharmful)^r?
We find this result unsurprising and implied by prior work, but include it for completeness. For example, Zou et al. 2023 showed that adding a harmfulness direction led to an 8 percentage point increase in refusal on harmless prompts in Vicuna 13B.
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
using this direction to intervene on model activations to steer the model towards or away from the concept (Burns et al. 2022
Burns et al. do activation engineering? I thought the CCS paper didn't involve that.
turntrout on Refusal in LLMs is mediated by a single directionBecause fine-tuning can be a pain and expensive? But you can probably do this quite quickly and painlessly.
If you want to say finetuning is better than this, or (more relevantly) finetuning + this, can you provide some evidence?
nicholaskross on Please stop publishing ideas/insights/research about AI"But my ideas are likely to fail! Can I share failed ideas?": If you share a failed idea, that saves the other person time/effort they would've spent chasing that idea. This, of course, speeds up that person's progress, so don't even share failed ideas/experiments about AI, in the status quo.
"So where do I privately share such research?" — good question! There is currently no infrastructure for this. I suggest keeping your ideas/insights/research to yourself. If you think that's difficult for you to do, then I suggest not thinking about AI, and doing something else with your time, like getting into factorio 2 or something.
"But I'm impatient about the infrastructure coming to exist!": Apply for a possibly-relevant grant and build it! Or build it in your spare time. Or be ready to help out if/when someone develops this infrastructure.
"But I have AI insights and I want to convert them into money/career-capital/personal-gain/status!": With that kind of brainpower/creativity, you can get any/all of those things pretty efficiently without publishing AI research, working at a lab, advancing a given SOTA, or doing basically (or literally) anything that differentially speeds up AI capabilities. This, of course, means "work on the object-level problem, without routing that work through AI capabilities", which is often as straightforward "do it yourself".
"But I'm wasting my time if I don't get involved in something related to AGI!": "I want to try LSD, but it's only available in another country. I could spend my time traveling to that country, or looking for mushrooms, or even just staying sober. Therefore, I'm wasting my time unless I immediately inject 999999 fentanyl."
yair-halberstadt on Can stealth aircraft be detected optically?Lens and CCD technology is not trivial at those speeds and insane angular resolution.
But we can easily capture a picture of a fighter jet when it's close. And the further it is the higher the angular resolution required, but also the lower the angular speed, so do those cancel out to make it easier, or it doesn't work like that?
tamsin-leake on Please stop publishing ideas/insights/research about AIOne straightforward alternative is to just not do that; I agree it's not very satisfying but it should still be the action that's pursued if it's the one that has more utility.
I wish I had better alternatives, but I don't. But the null action is an alternative.