Posts

1a3orn's Shortform 2024-01-05T15:04:31.545Z
Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk 2023-11-02T18:20:29.569Z
Ways I Expect AI Regulation To Increase Extinction Risk 2023-07-04T17:32:48.047Z
Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? 2023-06-01T19:36:48.351Z
Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds 2023-04-04T17:39:39.720Z
What is a good comprehensive examination of risks near the Ohio train derailment? 2023-03-16T00:21:37.464Z
Parameter Scaling Comes for RL, Maybe 2023-01-24T13:55:46.324Z
"A Generalist Agent": New DeepMind Publication 2022-05-12T15:30:17.871Z
New Scaling Laws for Large Language Models 2022-04-01T20:41:17.665Z
EfficientZero: How It Works 2021-11-26T15:17:08.321Z
Jitters No Evidence of Stupidity in RL 2021-09-16T22:43:57.972Z
How DeepMind's Generally Capable Agents Were Trained 2021-08-20T18:52:52.512Z
Coase's "Nature of the Firm" on Polyamory 2021-08-13T13:15:47.709Z
Promoting Prediction Markets With Meaningless Internet-Point Badges 2021-02-08T19:03:31.837Z

Comments

Comment by 1a3orn on What is the best argument that LLMs are shoggoths? · 2024-03-18T16:02:43.349Z · LW · GW

For a back and forth on whether the "LLMs are shoggoths" is propaganda, try reading this.

In my opinion if you read the dialogue, you'll see the meaning of "LLMs are shoggoths" shift back and forth -- from "it means LLMs are psychopathic" to "it means LLMs think differently from humans." There isn't a fixed meaning.

I don't think trying to disentangle the "meaning" of shoggoths is going to result in anything; it's a metaphor, some of whose understandings are obviously true ("we don't understand all cognition in LLMs"), some of which are dubiously true ("LLM's 'true goals' exist, and are horrific and alien"). But regardless of the truth of these props, you do better examining them one-by-one than in an emotionally-loaded image.

It's sticky because it's vivid, not because it's clear; it's reached for as a metaphor -- like "this government policy is like 1984" -- because it's a ready-to-hand example with an obvious emotional valence, not for any other reason.

If you were to try to zoom into "this policy is like 1984" you'd find nothing; so also here.

Comment by 1a3orn on What is the best argument that LLMs are shoggoths? · 2024-03-17T20:34:32.829Z · LW · GW

As you said, this seems like a pretty bad argument.

Something is going on between the {user instruction} ..... {instruction to the image model}. But we don't even know if it's in the LLM. It could be there's dumb manual "if" parsing statements that act differently depending on periods, etc, etc. It could be that there are really dumb instructions given to the LLM that creates instructions for the language model, as there were for Gemini. So, yeah.

Comment by 1a3orn on Raemon's Shortform · 2024-03-04T15:02:26.004Z · LW · GW

So Alasdair MacIntyre, says that all enquiry into truth and practical rationality takes place within a tradition, sometimes capital-t Tradition, that provides standards for things like "What is a good argument" and "What things can I take for granted" and so on. You never zoom all the way back to simple self-evident truths or raw-sense data --- it's just too far to go. (I don't know if I'd actually recommend MacIntyre to you, he's probably not sufficiently dense / interesting for your projects, he's like a weird blend of Aquinas and Kuhn and Lakatos, but he is interesting at least, if you have a tolerance for.... the kind of thing he is.)

What struck me with a fair number of reviews, at this point, was that they seemed... kinda resigned to a LW Tradition, if it ever existed, no longer really being a single thing? Like we don't have shared standards any more for what is a good argument or what things can be taken for granted (maybe we never did, and I'm golden-age fallacying). There were some reviews saying "idk if this is true, but it did influence people" and others being like "well I think this is kinda dumb, but seems important" and I know I wrote one being like "well these are at least pretty representative arguments of the kind of things people say to each other in these contexts."

Anyhow what I'm saying is that -- if we operate in a MacIntyrean frame -- it makes sense to be like "this is the best work we have" within a Tradition, but humans start to spit out NaNs / operation not defined if you try to ask them "is this the best work we have" across Traditions. I don't know if this is true of ideal reasoners but it does seem to be true of... um, any reasoners we've ever seen, which is more relevant.

Comment by 1a3orn on Rationality Research Report: Towards 10x OODA Looping? · 2024-02-29T18:13:37.475Z · LW · GW

So I agree with some of what you're saying along "There is such a thing as a generally useful algorithm" or "Some skills are more deep than others" but I'm dubious about some of the consequences I think that you think follow from them? Or maybe you don't think these consequences follow, idk, and I'm imagining a person? Let me try to clarify.

There's clusters of habits that seem pretty useful for solving novel problems

My expectation is that there are many skills / mental algorithms along these lines, such that you could truthfully say "Wow, people in diverse domains have found X mental algorithm useful for discovering new knowledge." But also I think it's probably true that the actually shared information between different domain-specific instances of "X mental algorithm" is going to be pretty small.

Like, take the skill of "breaking down skills into subskills, figuring out what subskills can be worked on, etc". I think there's probably some kind of of algorithm you can run cross-domain that does this kind of thing. But without domain-specific pruning heuristics, and like a ton of domain-specific details, I expect that this algorithm basically just spits back "Well, too many options" rather than anything useful.

So: I expect non-domain specific work put into sharpening up this algorithm to run into steeply diminishing returns, even if you can amortize the cost of sharpening up the algorithm across many different domains that would be benefitted. If you could write down a program that can help you find relevant subskills in some domain, about 95% of the program is going to be domain-specific rather than not domain specific, and there are something like only ~logarithmic returns to working on the domain-specific problem. (Not being precise, just an intuition)

Put alternately, I expect you could specify some kind of algorithm like this in a very short mental program, but when you're running the program most mental compute goes into finding domain-specific program details.


Let me just describe the way the world looks to me. Maybe we actually think the same thing?

-- If you look throughout the history of science, I think that most discoveries look less like "Discoverer had good meta-level principles that let them situate themselves in the right place to solve the issue" and more like "Discoverer happened to be interested in the right chunk of reality that let them figure out an important problem, but it was mostly luck in situating themselves or their skills in this place." I haven't read a ton of history of science, but yeah.

-- Concretely, my bet is that most (many?) scientific discoverers of important things were extremely wrong on other important things, or found their original discovery through something like luck. (And some very important discoveries (Transformers) weren't really identified as such at the time.)

-- Or, concretely, I think scientific progress overall probably hinges less on individual scientists having good meta-level principles, and more on like...whatever social phenomena is necessary to let individuals or groups of scientists run a distributed brute-force search. Extremely approximately.

-- So my belief is that so far we humans just haven't found any such principles like those you're seeking for. Or that a lack of such principles can screw over your group (if you eschew falsifiability to a certain degree you're fucked; if you ignore math you're fucked) but that you can ultimately mostly raise the floor rather than the ceiling through work on them. Like there is a lot of math out there, and different kinds are very useful for different things!

-- I would be super excited to find such meta-level principles, btw. I feel like I'm being relentlessly negative. So to be clear, it would be awesome to find substantive meta-level principles such that non-domain specific work on the meta-level principles could help people situate themselves and pursue work effectively in confusing domains. Like I'm talking about this because I am very much interested in the project. I just right now... don't think the world looks like they exist? It's just in that in the absence of seeing groups that seem to have such principles, nothing that I know about minds in general makes me think that such principles are likely.

Or maybe I'm just confused about what you're doing. Really uncertain about all the above.

Comment by 1a3orn on Rationality Research Report: Towards 10x OODA Looping? · 2024-02-25T19:51:55.619Z · LW · GW

This is less of "a plan" and more of "a model", but, something that's really weirded me out about the literature on IQ, transfer learning, etc, is that... it seems like it's just really hard to transfer learn. We've basically failed to increase g, and the "transfer learning demonstrations" I've heard of seemed pretty weaksauce.

But, all my common sense tells me that "general strategy" and "responding to novel information, and updating quickly" are learnable skills that should apply in a lot of domains.

I'm curious why you think this? Or if you have a place where you've explained why you think this at more length? Like my common sense just doesn't agree with this -- although I'll admit my common sense was probably different 5 years ago.

Overall a lot of the stuff here seems predicated on there being a very thick notion of non-domain specific "rationality" or "general strategy" that can be learned, that then after being learned speed you up in widely disparate domains. As in -- the whole effort is to find such a strategy. But there seems to be some (a lot? a little?) evidence that this just isn't that much of a thing, as you say.

I think current ML evidence backs this up. A Transformer is like a brain: when a Transformer is untrained, nearly literally the same architecture could learn to be a language model; to be an image diffusion model; to play Starcraft; etc etc. But once you've trained it, although it can learn very quickly in contexts to which it is adapted, it basically learns pretty poorly outside of these domains.

Similarly, human brains start of very plastic. You can learn to echolocate, or speak a dozen languages, or to ride a unicycle, or to solve IMO problems. And then brains specialize, and learn a lot of mostly domain-specific heuristics, that let them learn very quickly about the things that they already know. But they also learn to kinda suck elsewhere -- like, learning a dozen computer languages is mostly just going to not transfer to learning Chinese.

Like I don't think the distinction here I'm drawing is even well-articulated. And I could spend more time trying to articulate it -- there's probably some generality, maybe at the level of grit -- but the "learn domain-non-specific skills that will then speed up a particular domain" project seems to take a position that's sufficiently extreme that I'm like... ehhhh seems unlikely to succeed? (I'm in the middle of reading The Secret of Our Success fwiw, although it's my pre-existing slant for this position that has inclined me to read it.)

Comment by 1a3orn on TurnTrout's shortform feed · 2024-01-22T20:51:25.732Z · LW · GW

To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM's text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.

So you have one paper, from the abstract:

Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions (i.e., they share the top-ranked tokens). Most distribution shifts occur with stylistic tokens (e.g., discourse markers, safety disclaimers). These direct evidence strongly sup- ports the hypothesis that alignment tuning primarily learns to adopt the language style of AI assistants, and that the knowledge required for answering user queries predominantly comes from the base LLMs themselves.

Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.

(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don't realize.)

Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF'd language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective -- a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing -- preceding the model's generation with high-quality text -- was the only way to make it have high quality answers for base models -- but it's still there, hidden.

So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.

Comment by 1a3orn on TurnTrout's shortform feed · 2024-01-22T16:56:59.739Z · LW · GW

I agree this can be initially surprising to non-experts!

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

Comment by 1a3orn on TurnTrout's shortform feed · 2024-01-22T13:50:17.893Z · LW · GW

I like a lot of these questions, although some of them give me an uncanny feeling akin to "wow, this is a very different list of uncertainties than I have." I'm sorry the my initial list of questions was aggressive.

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

I'm not sure how they add up to alienness, though? They're about how we're different than models -- wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is "deeply alien" -- is that just saying it's different than us in lots of ways? I'm cool with that -- but the surplus negative valence involved in "LLMs are like shoggoths" versus "LLMs have very different performance characteristics than humans" seems to me pretty important.

Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don't call Python alien.

This feels like reminding an economics student that the market solves things differently than a human -- which is true -- by saying "The market is like Baal."

Do they require similar amounts and kinds of data to learn the same relationships?

There is a fun paper on this you might enjoy. Obviously not a total answer to the question.

Comment by 1a3orn on TurnTrout's shortform feed · 2024-01-21T21:01:29.914Z · LW · GW

performs deeply alien cognition

I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."


If LLM cognition was not "deeply alien" what would the world look like?

What distinguishing evidence does this world display, that separates us from that world?

What would an only kinda-alien bit of cognition look like?

What would very human kind of cognition look like?

What different predictions does the world make?

Does alienness indicate that it is because the models, the weights themselves have no "consistent beliefs" apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?

Is it because they "often spout completely non-human kinds of texts"? Is the Mersenne Twister deeply alien? What counts as "completely non-human"?

Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a "moral compass" apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?

Is it that the algorithms that we've found in DL so far don't seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?

Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually... kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?

Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?

Does every part of a system by itself need to fit into the average person's ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?

To re-question: What predictions can I make about the world because LLMs are "deeply alien"?

Are these predictions clear?

When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?

What kind of contexts does this "deeply alien" statement come up in? Are those contexts people are trying to explain, or to persuade?

If I piled up all the useful terms that I know that help me predict how LLMs behave, would "deeply alien" be an empty term on top of these?

Or would it give me no more predictive value than "many behaviors of an LLM are currently not understood"?

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2024-01-19T21:42:53.156Z · LW · GW

I mean, it's unrealistic -- the cells are "limited to English-language sources, were prohibited from accessing the dark web, and could not leverage print materials (!!)" which rules out textbooks. If LLMs are trained on textbooks -- which, let's be honest, they are, even though everyone hides their datasources -- this means teams who have access to an LLM have a nice proxy to a textbook through an LLM, and other teams don't.

It's more of a gesture at the kind of thing you'd want to do, I guess but I don't think it's the kind of thing that it would make sense to trust. The blinding was also really unclear to me.

Jason Matheny, by the way, the president of Rand, the organization running that study, is on Anthropic's "Long Term Benefit Trust." I don't know how much that should matter for your evaluation, but my bet is a non-zero amount. If you think there's an EA blob that funded all of the above -- well, he's part of it. OpenPhil funded Rand with 15 mil also.

You may think it's totally unfair to mention that; you may think it's super important to mention that; but there's the information, do what you will with it.

Comment by 1a3orn on 1a3orn's Shortform · 2024-01-18T20:34:10.604Z · LW · GW

I mean, I should mention that I also don't think that agentic models will try to deceive us if trained how LLMs currently are, unfortunately.

Comment by 1a3orn on Richard Ngo's Shortform · 2024-01-18T20:32:47.291Z · LW · GW

So, there are a few different reasons, none of which I've formalized to my satisfaction.

I'm curious if these make sense to you.

(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.

As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)

Of course in use you can teach a GPT to use a calculator -- but we're talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.

(Note -- to be very clear NONE of this is a limitation on what kind of operations we can get a transformer to do over multiple unrollings of the forward pass. You can teach a transformer to use a calculator; or to ask a friend for help; or to use a scratchpad, or whatever. But we need to hide deception in a single forward pass, which is why I'm harping on this.)

So to think that you learn deception in forward pass, you have to think that the transformer thinks something like "Hey, if I deceive the user into thinking that I'm a good entity, I'll be able to later seize power, and if I seize power, then I'll be able to (do whatever), so -- considering all this, I should... predict the next token will be "purple"" -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it's directly relevant for token predictions.

(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there's at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).

That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.

For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network -- in a single forward pass -- used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network -- when embedded in some device down the road -- would turn and kill us?

The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.

(All of which is to say -- look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that's true! But I've never seen anyone say this, which makes me think they're making the mistake I describe above.)

(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.

Like, different Transformers store different things depending on the data they're given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.

The "LLM will learn deception" hypothesis amounts to saying that -- so long as a LLM is big enough, and trained on enough data to know the world exists -- you'll find complex machinery in it that (1) specifically activates once it figures out that it's "not in training" and (2) was mostly just hiding until then. My bet is that this won't show up, because there are no such structures in a Transformer that don't depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough -- we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc.


I think an actually well-put together argument will talk about frequency bias and shit, but this is all I feel like typing for now.

Does this make sense? I'm still working on putting it together.

Comment by 1a3orn on Richard Ngo's Shortform · 2024-01-18T15:11:32.432Z · LW · GW

If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they're relevant. So the speed cost of deception will be amortized across the (likely very long) training period.

You mean this about something trained totally differently than a LLM, no? Because this mechanism seems totally implausible to me otherwise.

Comment by 1a3orn on 1a3orn's Shortform · 2024-01-15T19:36:56.527Z · LW · GW

Just a few quick notes / predictions, written quickly and without that much thought:

(1) I'm really confused why people think that deceptive scheming -- i.e., a LLM lying in order to post-deployment gain power -- is remotely likely on current LLM training schemes. I think there's basically no reason to expect this. Arguments like Carlsmith's -- well, they seem very very verbal and seems presuppose that the kind of "goal" that an LLM learns to act to attain during contextual one roll-out in training is the same kind of "goal" that will apply non-contextually to the base model apart from any situation.

(Models learn extremely different algorithms to apply for different parts of data -- among many false things, this argument seems to presuppose a kind of unity to LLMs which they just don't have. There's actually no more reason for a LLM to develop such a zero-context kind of goal than for an image segmentation model, as far as I can tell.)

Thus, I predict that we will continue to not find such deceptive scheming in any models, given that we train them about like how we train them -- although I should try to operationalize this more. (I understand Carlsmith / Yudkowsky / some LW people / half the people on the PauseAI discord to think something like this is likely, which is why I think it's worth mentioning.)

(To be clear -- we will continue to find contextual deception in the model if we put it there, whether from natural data (ala Bing / Sydney / Waluigi) or unnatural data (the recent Anthropic data). But that's way different!)

(2). All AI systems that have discovered something new have been special-purpose narrow systems, rather than broadly-adapted systems.

While "general purpose" AI has gathered all the attention, and many arguments seem to assume that narrow systems like AlphaFold / materials-science-bot are on the way out and to be replaced by general systems, I think that narrow systems have a ton of leverage left in them. I bet we're going to continue to find amazing discoveries in all sorts of things from ML in the 2020s, and the vast majority of them will come from specialized systems that also haven't memorized random facts about irrelevant things. I think if you think LLMs are the best way to make scientific discoveries you should also believe the deeply false trope from liberal arts colleges about a general "liberal arts" education being the best way to prepare for a life of scientific discovery. [Note that even systems that use non-specialized systems as a component like LLMs will themselves be specialized].

LLMs trained broadly and non-specifically will be useful, but they'll be useful for the kind of thing where broad and nonspecific knowledge of the world starts to be useful. And I wouldn't be surprised that the current (coding / non-coding) bifurcation of LLMs actually continued into further bifurcation of different models, although I'm a lot less certain about this.

(3). The general view that "emergent behavior" == "I haven't looked at my training data enough" will continue to look pretty damn good. I.e., you won't get "agency" from models scaling up to any particular amount. You get "agency" when you train on people doing things.

(4) Given the above, most arguments about not deploying open source LLMs look to me mostly like bog-standard misuse arguments that would apply to any technology. My expectations from when I wrote about ways AI regulation could be bad have not changed for the better, but for the much much worse.

I.e., for a sample -- numerous orgs have tried to outlaw open source models of the kind that currently exist because because of their MMLU scores! If you think are worried about AI takeover, and think "agency" appears as a kind of frosting on top of of a LLM after it memorizes enough facts about the humanities and medical data, that makes sense. If you think that you get agency by training on data where some entity is acting like an agent, much less so!

Furthermore: MMLU scores are also insanely easy to game, both in the sense that a really stupid model can get 100% by just training on the test set; and also easy to game, in the sense that a really smart model could get almost arbitrarily low by excluding particular bits of data or just training to get the wrong answer on the test set. It's the kind of rule that would be goodhearted to death the moment it came into existence -- it's a rule that's already been partially goodhearted to death -- and the fact that orgs are still considering it is an update downward in the competence of such organizations.

Comment by 1a3orn on why assume AGIs will optimize for fixed goals? · 2024-01-13T15:37:16.029Z · LW · GW

I think that (1) this is a good deconfusion post, (2) it was an important post for me to read, and definitely made me conclude that I had been confused in the past, (3) and one of the kinds of posts that, ideally, in some hypothetical and probably-impossible past world, would have resulted in much more discussion and worked-out-cruxes in order to forestall the degeneration of AI risk arguments into mutually incomprehensible camps with differing premises, which at this point is starting to look like a done deal?

On the object level: I currently think that -- well, there are many, many, many ways for an entity to have it's performance adjusted so that it does well by some measure. One conceivable location that some such system could arrive in is for it to move to an outer-loop-fixed goal, per the description of the post. Rob (et al) think that there is a gravitational attraction towards such an outer-loop-fixed goal, across an enormous variety of future architectures, such that multiplicity of different systems will be pulled into such a goal, will develop long-term coherence towards (from our perspective) random goals, and so on.

I think this is almost certainly false, even for extremely powerful systems -- to borrow a phrase, it seems equally well to be an argument that humans should be automatically strategic, which of course they are not. It also part of the genre of arguments that argue that AI systems should act in particular ways regardless of their domain, training data, and training procedure -- which I think by now we should have extremely strong priors against, given that for literally all AI systems -- and I mean literally all, including MCTS-based self-play systems -- the data from which the NN's learn is enormously important for what those NNs learn. More broadly, I currently think the gravitational attraction towards such an outer-loop-fixed-goal will be absolutely tiny, if at all present, compared to the attraction towards more actively human-specified goals.

But again, that's just a short recap of one way to take what is going on in the post, and one that of course many people will not agree with. Overall, I think the post itself, Robs's reply, and Nostalgebrist's reply to Rob's reply, are all pretty good at least as a summary of the kind of thing people say about this.

Comment by 1a3orn on peterbarnett's Shortform · 2024-01-07T22:13:05.185Z · LW · GW

Is there a place that you think canonically sets forth the evolution analogy and why it concludes what it concludes in a single document? Like, a place that is legible and predictive, and with which you're satisfied as self-contained -- at least speaking for yourself, if not for others?

Comment by 1a3orn on 1a3orn's Shortform · 2024-01-05T15:02:23.673Z · LW · GW

Just registering that I think the shortest timeline here looks pretty wrong.

Ruling intuition here is that ~0% remote jobs are currently automatable, although we have a number of great tools to help people do em. So, you know, we'd better start doubling on the scale of a few months if we are gonna hit 99% automatable by then, pretty soon.

Cf. timeline from first self-driving car POC to actually autonomous self-driving cars.

Comment by 1a3orn on Does LessWrong make a difference when it comes to AI alignment? · 2024-01-03T23:51:01.230Z · LW · GW

What are some basic beginner resources someone can use to understand the flood of complex AI posts currently on the front page? (Maybe I'm being ignorant, but I haven't found a sequence dedicated to AI...yet.)

There is no non-tradition-of-thought specific answer to that question.

That is, people will give you radically different answers depending on what they believe. Resources that are full of just.... bad misconceptions, from one perspective, will be integral for understanding the world, from another.

For instance, the "study guide" referred to in another post lists the "List of Lethalities" by Yudkowsky as an important resource. Yet if you go to the only current review of it on LessWrong thinks that it is basically just confused, extremely badly, and that "deeply engaging with this post is, at best, a waste of time." I agree with this assessment, but my agreement is worthless in the face of the vast agreements and disagreements swaying back and forth.

Your model here should be that you are asking a room for of Lutherans, Presbyterians, Methodists, Baptists, Anabaptists, Huttites, and other various and sundry Christian groups, and asking them for the best introduction to interpreting the Bible. You'll get lots of different responses! You might be able to pick out the leading thinkers for each group. But there will be no consensus about what the right introductory materials are, because there is no consensus in the group.

For myself, I think that before you think about AI risk you should read about how AI, as it is practiced, actually works. The 3blue1brown course on neural networks; the Michael Nielsen Deep Learning book online; tons of stuff from Karpathy; these are all excellent. But -- this is my extremely biased opinion, and other people doubtless think it is bad.

Comment by 1a3orn on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-21T20:57:09.967Z · LW · GW

I think that this general point about not understanding LLMs is being pretty systematically overstated here and elsewhere in a few different ways.

(Nothing against the OP in particularly, which is trying to lean on the let's use this politically. But leaning on things politically is not... probably... the best way to make those terms clearly used? Terms even more clear than "understand" are apt to break down under political pressure, and "understand" is already pretty floaty and a suitcase word)

What do I mean?

Well, two points.

  1. If we don't understand the forward pass of a LLM, then according to this use of "understanding" there are lots of other things we don't understand that we nevertheless are deeply comfortable with.

Sure, we have an understanding of the dynamics of training loops and SGD's properties, and we know how ML models' architectures work. But we don't know what specific algorithms ML models' forward passes implement.

There are a lot of ways you can understand "understanding" the specific algorithm that ML models implement in their forward pass. You could say that understanding here means something like "You can turn implemented algorithm from a very densely connected causal graph with many nodes, into an abstract and sparsely connected causal graph with a handful of nodes with human readable labels, that lets you reason about what happens without knowing the densely connected graph."

But like, we don't understand lots of things in this way! And these things are nevertheless able to be engineered or predicted well, and which are not frightening at all. In this sense we also don't understand:

  1. Weather
  2. The dynamics going on inside rocket exhaust, or a turbofan, or anything we model with CFD software
  3. Every other single human's brain on this planet
  4. Probably our immune system

Or basically anything with chaotic dynamics. So sure, you can say we don't understand the forward pass of an LLM, so we don't understand them. But like -- so what? Not everything in the world can be decomposed into a sparse causal graph, and we still say we understand such things. We basically understand weather. I'm still comfortable flying on a plane.

  1. Inability to intervene effectively at every point in a causal process doesn't mean that it's unpredictable or hard to control from other nodes.

Or, at the very least, that it's written in legible, human-readable and human-understandable format, and that we can interfere on it in order to cause precise, predictable changes.

Analogically -- you cannot alter rocket exhaust in predictable ways, once it has been ignited. But, you can alter the rocket to make the exhaust do what you want.

Similarly, you cannot alter an already-made LLM in predictable ways without training it. But you can alter an LLM that you are training in.... really pretty predictable ways.

Like, here are some predictions:

(1) The LLMs that are good at chess have a bunch of chess in their training data, with absolutely 0.0 exceptions

(2) The first LLMs that are good agents will have a bunch of agentlike training data fed into them, and will be best at the areas for which they have the most high-quality data

(3) If you can get enough data to make an agenty LLM, you'll be able to make an LLM that does pretty shittily on the MMLU relative to GPT-4 etc, but which is a very effective agent, by making "useful for agent" rather than "useful textbook knowledge" the criteria for inclusion in the training data. (MMLU is not an effective policy intervention target!

(4) Training is such an effective way of putting behavior into LLMs that even when interpretability is like, 20x better than it is now, people will still usually be using SGD or AdamW or whatever to give LLMs new behavior, even when weight-level interventions are possible.

So anyhow -- the point is that the inability to intervene or alter a process at any point along the creation doesn't mean that we cannot control it effectively at other points. We can control LLMs along other points.

(I think AI safety actually has a huge blindspot here -- like, I think the preponderance of the evidence is that the effective way to control not merely LLMs but all AI is to understand much more precisely how they generalize from training data, rather than by trying to intervene in the created artifact. But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.)

Comment by 1a3orn on 1a3orn's Shortform · 2023-12-21T15:11:30.219Z · LW · GW
Comment by 1a3orn on Principles For Product Liability (With Application To AI) · 2023-12-10T23:06:32.934Z · LW · GW

This post consistently considers AI to be a "product." It discusses insuring products (like cars), compares to insuring the product photoshop, and so on.

But AI isn't like that! Llama-2 isn't a product -- by itself, it's relatively useless, particularly the base model. It's a component of a product, like steel or plastic or React or Typescript. It can be used in a chatbot, in a summarization application, in a robot service-representative app, in a tutoring tool, a flashcards app, and so on and so forth.

Non-LLM things -- like segmentation models -- are even further from being a product than LLMs.

If it makes sense to get liability insurance for the open-source framework React, then it would make sense for AI. But it doesn't at all! The only insurance that I know of are for things that are high-level final results in the value chain, rather than low-level items like steel or plastic.

I think it pretty obvious that requiring steel companies to get insurance for the misuse of their steel is a bad idea, one that this post... just sidesteps?

Now we have the machinery to properly reply to that comment. In short: it’s a decent analogy (assuming there’s some lawsuit-able harm from fake driver’s licenses). The part I disagree with is the predicted result. What I actually think would happen is that Photoshop would be mildly more expensive, and would contain code which tries to recognize and stop things like editing a photo of a driver’s license. Or they’d just eat the cost without any guardrails at all, if users really hated the guardrails and were willing to pay enough extra to cover liability.

What's weird about this post is that, until modern DL-based computer vision was invented, this would have actually been an enormous pain -- honestly, one that I think would be quite possibly impossible to implement effectively. Prior to DL it would be even more unlikely that you could, for instance, make it impossible to use photoshop to make porn of someone without also disabling legitimate use -- yet the original post wants to sue ML companies on the basis that their technology being used for that. I dunno man.

Comment by 1a3orn on Why all the fuss about recursive self-improvement? · 2023-12-09T01:58:12.474Z · LW · GW

I think this post paints a somewhat inaccurate view of the past.

The post claims that MIRI's talk of recursive self-improvement from a seed AI came about via MIRI’s attempts to respond to claims such as "AI will never exceed human capabilities" or "Growth rates post AI will be like growth rates beforehand." Thus, the post says, people in MIRI spoke of recursive self-improvement from a seed AI not because they thought this was a particularly likely mainline future -- but because they thought this was one obvious way that AI -- past a certain level of development -- would obviously exceed human capabilities and result in massively different growth rates. Thus, the post says:

The weighty conclusion of the "recursive self-improvement" meme is not “expect seed AI”. The weighty conclusion is “sufficiently smart AI will rapidly improve to heights that leave humans in the dust”.

However, I think this view of the past is pretty certainly misleading, because the Singularity Institute -- what MIRI was before a rebranding -- actually intended to build a seed AI.

Thus, bringing up recursive self improvement from a seed AI was not just a rhetorical move to point out how things would go nuts eventually -- it was actually something they saw as central to the future.

From the Singularity Institute Website, circa 2006, emphasis mine:

SIAI has the additional goal of fostering a broader discussion and understanding of beneficial artificial intelligence. We offer forums for Singularity discussion, coordinate Singularity-related efforts, and publish material on the Singularity. Above all, our long-term mission is direct research into Singularity technologies, specifically Friendly AI, and the direct implementation of the Singularity. We're presently seeking funding to begin our long-term project to create recursively self-improving AI that displays true general cognition - a Singularity seed.

Similarly, in his 2011 debate with Hanson, Yudkowsky humorously describes the Singularity Institute as the "“Institute for Carefully Programmed Intelligence Explosions," and goes on to describe how he thinks the future is likely to go:

When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion.

There are other locations where you can see that the original intent of the Singularity Institute / MIRI was to build a seed AI.

Thus, I do not think that MIRI spoke so much about recursive self improvement merely as a rhetorical move to show that AI would eventually be able to exceed humans. I think they spoke about it because that's what they were planning to build -- or at least in part. I think the post is likely to -- at best -- somewhat distort this view of the world, by leaving out this highly relevant fact.

Comment by 1a3orn on Based Beff Jezos and the Accelerationists · 2023-12-07T15:15:07.990Z · LW · GW

Honestly, citation needed on both sides of that debate, because I haven't seen a bunch of evidence or even really falsifiable predictions in support of the view that "zombies" have an advantage either.

I haven't either, but Blindsight is a great novel about that :)

Comment by 1a3orn on Based Beff Jezos and the Accelerationists · 2023-12-06T18:57:42.014Z · LW · GW

Indeed. This whole post shows a great deal of incuriosity of as to what Beff thinks, spending a lot of time on, for instance, what Yudkowsky thinks Beff thinks.

If you'd prefer to read an account of Beff's views from himself, take a look at the manifesto

Some relevant sections, my emphasis:

e/acc has no particular allegiance to the biological substrate for intelligence and life, in contrast to transhumanism

Parts of e/acc (e.g. Beff) consider ourselves post-humanists; in order to spread to the stars, the light of consciousness/intelligence will have to be transduced to non-biological substrates

Directly working on technologies to accelerate the advent of this transduction is one of the best ways to accelerate the progress towards growth of civilization/intelligence in our universe

In order to maintain the very special state of matter that is life and intelligence itself, we should seek to acquire substrate-independence and new sets of resources/energy beyond our planet/solar system, as most free energy lies outwards

As higher forms of intelligence yield greater advantage to meta-organisms to adapt and find and capitalize upon resources from the environment, these will be naturally statistically favored

No need to worry about creating “zombie” forms of higher intelligence, as these will be at a thermodynamic/evolutionary disadvantage compared to conscious/higher-level forms of intelligence

Focusing strictly on transhumanism as the only moral path forward is an awfully anthropocentric view of intelligence;

in the future, we will likely look back upon such views in a similar way to how we look back at geocentrism

if one seeks to increase the amount of intelligence in the universe, staying perpetually anchored to the human form as our prior is counter-productive and overly restrictive/suboptimal

If every species in our evolutionary tree was scared of evolutionary forks from itself, our higher form of intelligence and civilization as we know it would never have had emerged

Some chunk of the hatred may... be a terminological confusion. I'd be fine existing as an upload; by Beff's terminology that would be posthuman and NOT transhuman, but some would call it transhuman.

Regardless, note that the accusation that he doesn't care about consciousness just seems literally entirely false.

Comment by 1a3orn on Open Thread – Winter 2023/2024 · 2023-12-05T01:32:27.712Z · LW · GW

FWIW I was going to start betting on Manifold, but I have no idea how to deal with meditative absorption as an end-state.

Like there are worlds where -- for instance -- Vit D maybe helps this, or Vit D maybe hurts, and it might depend on you, or it depends on what kind of meditation really works for you. So it takes what is already pretty hard bet for me -- just calling whether nicotine is actually likely to help in some way -- and makes it harder -- is nicotine going to help meditation. Just have no idea.

Comment by 1a3orn on AI #40: A Vision from Vitalik · 2023-12-01T04:03:26.154Z · LW · GW

Note that that's from 2011 -- it says things which (I agree) could be taken to imply that humans will go extinct, but doesn't directly state it.

On the other hand, here's from 6 months ago:

Jones: The existential threat that’s implied is the extent to which humans have control over this technology. We see some early cases of opportunism which, as you say, tends to get more media attention than positive breakthroughs. But you’re implying that this will all balance out?

Schmidhuber: Historically, we have a long tradition of technological breakthroughs that led to advancements in weapons for the purpose of defense but also for protection. From sticks, to rocks, to axes to gunpowder to cannons to rockets… and now to drones… this has had a drastic influence on human history but what has been consistent throughout history is that those who are using technology to achieve their own ends are themselves, facing the same technology because the opposing side is learning to use it against them. And that's what has been repeated in thousands of years of human history and it will continue. I don't see the new AI arms race as something that is remotely as existential a threat as the good old nuclear warheads.

Comment by 1a3orn on AI #40: A Vision from Vitalik · 2023-11-30T18:25:32.973Z · LW · GW

The quote from Schmidhuber literally says nothing about human extinction being good.

I'm disappointed that Critch glosses it that way, because in the past he has been leveler-headed than many, but he's wrong.

The quote is:

“Don’t think of us versus them: us, the humans, v these future super robots. Think of yourself, and humanity in general, as a small stepping stone, not the last one, on the path of the universe towards more and more unfathomable complexity. Be content with that little role in the grand scheme of things.” As for the near future, our old motto still applies: “Our AI is making human lives longer & healthier & easier.”

Humans not being the "last stepping stone" towards greater complexity does not imply that we'll go extinct. I'd be happy to live in a world where there are things more complex than humans. Like it's not a weird interpretation at all -- "AI will be more complex than humans" or "Humans are not the final form of complexity in the universe" simply says nothing at all about "humans will go extinct."

You could spin it into that meaning if you tried really hard. But -- for instance-- the statement could also be about how AI will do science better than humans in the future, which was (astonishingly) the substance of the talk in which this statement took place, and also what Schmidhuber has been on about for years, so it probably is what he's actually talking about.

I note that you say, in your section on tribalism.

Accelerationists mostly got busy equating anyone who thinks smarter than human AIs might pose a danger to terrorists and cultists and crazies. The worst forms of ad hominem and gaslighting via power were on display.

It would be great if people were a tad more hesitant to accuse others of wanting omnicide.

Comment by 1a3orn on Stephanie Zolayvar's Shortform · 2023-11-24T17:32:47.038Z · LW · GW

Many people in OpenAI truly believe they're doing the right thing, and did so two weeks ago.

According to almost all accounts, the board did not give the people working at OpenAI any new evidence that they were doing something bad! They just tried to dictate to them without any explanation, and employees responded as humans are apt to do when someone tries to dictate to them without explanation, whether they were right or wrong.

Which is to say -- I don't think we've really gotten any evidence about the people are being collectively sociopathic there.

Comment by 1a3orn on OpenAI: The Battle of the Board · 2023-11-23T01:39:24.313Z · LW · GW

The Gell-Mann Amnesia effect seems pretty operative, given the first name on the relevant NYT article is the same guy who did some pretty bad reporting on Scott Alexander.

If you don't think the latter was a reliable summary of Scott's blog, there's not much reason to think that the former is a reliable summary of the OpenAI situation.

Comment by 1a3orn on Vote on worthwhile OpenAI topics to discuss · 2023-11-21T12:09:08.365Z · LW · GW

The board's behavior is non-trivial evidence against EA promoting willingness-to-cooperate and trustworthiness.

Comment by 1a3orn on Vote on Interesting Disagreements · 2023-11-08T02:04:16.050Z · LW · GW

LLMs as currently trained run ~0 risk of catastrophic instrumental convergence even if scaled up with 1000x more compute

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-06T14:22:39.481Z · LW · GW

FWIW: I think you're right that I should have paid more attention to the current v future models split in the paper. But I also think that the paper is making... kinda different claims at different times.

Specifically when it talks about the true-or-false world-claims it makes, it talks about models potentially indefinitely far in the future; but when it talks about policy it talks about things you should start doing soon or now.

For instance, consider part 1 of the conclusion:

1. Developers and governments should recognise that some highly capable models will be too dangerous to open-source, at least initially.

If models are determined to pose significant threats, and those risks are determined to outweigh the potential benefits of open-sourcing, then those models should not be open-sourced. Such models may include those that can materially assist development of biological and chemical weapons [50, 109], enable successful cyberattacks against critical national infrastructure [52], or facilitate highly-effective manipulation and persuasion [88].[30]

The [50] and [109] citations are to the two uncontroled, OpenPhil-funded papers from my "science" section above. The [30] is to a footnote like this:

Note that we do not claim that existing models are already too risky. We also do not make any predictions about how risky the next generation of models will be. Our claim is that developers need to assess the risks and be willing to not open-source a model if the risks outweigh the benefits.

And like... if you take this footnote literally, then this paragraph is almost tautologically true!

Even I think you shouldn't open source a model "if the risks outweigh the benefits," how could I think otherwise? If you take it to be making no predictions about current or the next generation of models -- well, nothing to object to. Straightforward application of "don't do bad things."

But if you take it literally -- "do not make any predictions"? -- then there's no reason to actually recommend stuff in the way that the next pages do, like saying the NIST should provide guidance on whether it's ok to open source something; and so on. Like there's a bunch of very specific suggestions that aren't the kind of thing you'd be writing about a hypothetical or distant possibility.

And this sits even more uneasily with claims from earlier in the paper: "Our general recommendation is that it is prudent to assume that the next generation of foundation models could exhibit a sufficiently high level of general-purpose capability to actualize specific extreme risks." (p8 -- !?!). This comes right after it talks about the biosecurity risks of Claude. Or "AI systems might soon present extreme biological risk" Etc.

I could go on, but in general I think the paper is just... unclear about what it is saying about near-future models.

For purposes of policy, it seems to think that we should spin up things to specifically legislate the next gen; it's meant to be a policy paper, not a philosophy paper, after all. This is true regardless of whatever disclaimers it includes about how it is not making predictions about the next gen. This seems very true when you look at the actual uses to which the paper is put.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-05T12:28:29.783Z · LW · GW

For concrete experiments, I think this is in fact the place where having an expert tutor becomes useful. When I started in a synthetic biology lab, most of the questions I would ask weren’t things like “how do I hold a pipette” but things like “what protocols can I use to check if my plasmid correctly got transformed into my cell line?” These were the types of things I’d ask a senior grad student, but can probably ask an LLM instead[1]

Right now I can ask a closed-source LLM API this question. Your policy proposal contains no provision to stop such LLMs from answering this question. If this kind of in-itself-innocent question is where danger comes from, then unless I'm confused you need to shut down all bio lab questions directed at LLMs -- whether open source or not -- because > 80% of the relevant lab-style questions can be asked in an innocent way.

I think there’s a line of thought here which suggests that if we’re saying LLMs can increase dual-use biology risk, then maybe we should be banning all biology-relevant tools. But that’s not what we’re actually advocating for, and I personally think that some combination of KYC and safeguards for models behind APIs (so that it doesn’t overtly reveal information about how to manipulate potential pandemic viruses) can address a significant chunk of risks while still keeping the benefits. The paper makes an even more modest proposal and calls for catastrophe liability insurance instead.

If the government had required you to have catastrophe liability insurance for releasing open source software in the year 1995, then, in general I expect we would have no open source software industry today because 99.9% of this software would not be released. Do you predict differently?

Similarly for open source AI. I think when you model this out it amounts to an effective ban, just one that sounds less like a ban when you initially propose it.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T16:45:57.393Z · LW · GW

And it's a really difficult epistemic environment, since someone who was incorrectly convinced by a misinterpretation of a concrete example they think is dangerous to share is still wrong.

I agree that this is true, and very unfortunate; I agree with / like most of what you say.

But -- overall, I think if you're an org that has secret information, on the basis of which you think laws should be passed, you need to be absolutely above reproach in your reasoning and evidence and funding and bias. Like this is an extraordinary claim in a democratic society, and should be treated as such; the reasoning that you do show should be extremely legible, offer ways for itself to be falsified, and not overextend in its claims. You should invite trusted people who disagree with you in adversarial collaborations, and pay them for their time. Etc etc etc.

I think -- for instance -- that rather than leap from an experiment maybe showing risk, to offering policy proposals in the very same paper, it would be better to explain carefully (1) what total models the authors of the paper have of biological risks, and how LLMs contribute to them (either open-sourced or not, either jailbroken or not, and so on), and what the total increased scale of this risk is, and to speak about (2) what would constitute evidence that LLMs don't contribute to risk overall, and so on.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T13:55:43.778Z · LW · GW

Evidence for X is when you see something that's more likely in a world with X than in a world with some other condition not X.

Generally substantially more likely; for good reason many people only use "evidence" to mean "reasonably strong evidence."

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T13:54:08.317Z · LW · GW

Finally (again, as also mentioned by others), anthrax is not the important comparison here, it’s the acquisition or engineering of other highly transmissible agents that can cause a pandemic from a single (or at least, single digit) transmission event.

At least one paper that I mention specifically gives anthrax as an example of the kind of thing that LLMs could help with, and I've seen the example used in other places. I think if people bring it up as a danger it's ok for me to use it as a comparison.

LLMs are useful isn’t just because they’re information regurgitators, but because they’re basically cheap domain experts. The most capable LLMs (like Claude and GPT4) can ~basically already be used like a tutor to explain complex scientific concepts, including the nuances of experimental design or reverse genetics or data analysis.

I'm somewhat dubious that a tutor to specifically help explain how to make a plague is going to be that much more use than a tutor to explain biotech generally. Like, the reason that this is called "dual-use" is that for every bad application there's an innocuous application.

So, if the proposal is to ban open source LLMs because they can explain the bad applications of the in-itself innocuous thing -- I just think that's unlikely to matter? If you're unable to rephrase a question in an innocuous way to some LLM, you probably aren't gonna make a bioweapon even with the LLMs help, no disrespect intended to the stupid terrorists among us.

It's kinda hard for me to picture a world where the delta in difficulty in making a biological weapon between (LLM explains biotech) and (LLM explains weapon biotech) is in any way a critical point along the biological weapons creation chain. Is that the world we think we live in? Is this the specific point you're critiquing?

If the proposal is to ban all explanation of biotechnology from LLMs and to ensure it can only be taught by humans to humans, well, I mean, I think that's a different matter, and I could address the pros and cons, but I think you should be clear about that being the actual proposal.

For instance, the post says that “if open source AI accelerated the cure for several forms of cancer, then even a hundred such [Anthrax attacks] could easily be worth it”. This is confusing for a few different reasons: first, it doesn’t seem like open-source LLMs can currently do much to accelerate cancer cures, so I’m assuming this is forecasting into the future. But then why not do the same for bioweapons capabilities?

This makes sense as a critique: I do think that actual biotech-specific models are much, much more likely to be used for biotech research than LLMs.

I also think that there's a chance that LLMs could speed up lab work, but in a pretty generic way like Excel speeds up lab work -- this would probably be good overall, because increasing the speed of lab work by 40% and terrorist lab work by 40% seems like a reasonably good thing for the world overall. I overall mostly don't expect big breakthroughs to come from LLMs.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T13:24:39.161Z · LW · GW

Therefore, if you want to argue against the conclusion that we should eventually ban open source LLMs on the grounds of biorisk, you should not rely on the poor capabilities of current models as your key premise.

Just to be clear, the above is not what I would write if I were primarily trying to argue against banning future open source LLMs for this reason. It is (more) meant to be my critique of the state of the argument -- that people are basically just not providing good evidence on for banning them, are confused about what they are saying, that they are pointing out things that would be true in worlds where open source LLMs are perfectly innocuous, etc, etc.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T22:55:07.818Z · LW · GW

Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.

Note -- that if I thought regulations would be temporary, or had a chance of loosening over time after evals found that the risks from models at compute size X would not be catastrophic, I would be much less worried about all the things I'm worried about re. open source and power and and banning open source

But I just don't think that most regulations will be temporary. A large number of people want to move compute limits down over time. Some orgs (like PauseAI or anything Leahy touches) want much lower limits than are implied by the EO. And of course regulators in general are extremely risk averse, and the trend is almost always for regulations to increase.

If the AI safety movement could creditably promise in some way that they would actively push for laws whose limits raised over time, in the default case, I'd be less worried. But given (1) the conflict on this issue within AI safety itself (2) the default way that regulations work, I cannot make myself believe that "temporary regulation" is ever going to happen.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:59:22.737Z · LW · GW

As an addition -- Anthropic's RSP already has GPT-4 level models already locked up behind safety level 2.

Given that they explicitly want their RSPs to be a model for laws and regulations, I'd be only mildly surprised if we got laws banning open source even at GPT-4 level. I think many people are actually shooting for this.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:22:56.800Z · LW · GW

Edited ending to be more tentative in response to critique.

Comment by 1a3orn on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:16:33.626Z · LW · GW

In particular, consider covid. It seems reasonably likely that covid was an accidental lab leak (though attribution is hard) and it also seems like it wouldn't have been that hard to engineer covid in a lab. And the damage from covid is clearly extremely high. Much higher than the anthrax attacks you mention. I people in biosecurity think that the tails are more like billions dead or the end of civilization. (I'm not sure if I believe them, the public object level cases for this don't seem that amazing due to info-hazard concerns.)

I agree that if future open source models contribute substantially to the risk of something like covid, that would be a component in a good argument for banning them.

I'm dubious -- haven't seen much evidence -- that covid itself is evidence that future open source models would so contribute? Given that -- to the best of my very limited knowledge -- the research being conducted was pretty basic (knowledgewise) but rather expensive (equipment and timewise), so that an LLM wouldn't have removed a blocker. (I mean, that's why it came from a US and Chinese-government sponsored lab for whom resources were not an issue, no?) If there is an argument to this effect, 100% agree it is relevant. But I haven't looked into the sources of Covid for years anyhow, so I'm super fuzzy on this.

Further, suppose that open-source'd AI models could assist substantially with curing cancer. In that world, what probability would you assign to these AIs also assisting substantially with bioterror?

Fair point. Certainly more than in the other world.

I do think that your story is a reasonable mean between the two, with less intentionality, which is a reasonable prior for organizations in general.

I think the prior of "we should evaluate thing ongoingly and be careful about LLMs" when contrasted with "we are releasing this information on how to make plagues in raw form into the wild every day with no hope of retracting it right now" simply is an unjustified focus of one's hypothesis on LLMs causing dangers, against all the other things in the world more directly contributing to the problem. I think a clear exposition of why I'm wrong about this would be more valuable than any of the experiments I've outlined.

Comment by 1a3orn on Will releasing the weights of large language models grant widespread access to pandemic agents? · 2023-10-31T00:44:15.382Z · LW · GW

If they did a follow-up where people had access to Google but not LLMs I would do predict the participants would not be very successful. Would you predict otherwise?

Yeah, I think you could be quite successful without a jailbroken LLM. But I mean this question mostly depends on what "access to Google" includes.

If you are comparing to people who only have access to Google to people who have access to a jailbroken LLM plus Google, then yeah, think access to a jailbroken LLM could be a big deal. 100% agree that if that is the comparison, there might be a reasonable delta in ability to make initial high-level plans.

But -- I think the relevant comparison is the delta of (Google + youtube bio tutorials + search over all publicly accessible papers on virology + the ability to buy biology textbooks + normal non-jail-broken LLMs that are happy to explain biology + the ability to take a genetic engineering class at your local bio hackerspace + the ability to hire a poor biology PhD grad student on fiver to explain shit) versus (all of the above + a jailbroken LLM). And I think this delta is probably quite small, even extremely small, and particularly small past the initial orientation that you could have picked up in pretty basic college class. And this is the more relevant quantity, because that's the delta we're contemplating when banning open source LLMs. Would you predict otherwise?

I know that if were trying to kill a bunch of people I would much rather drop "access to a jailbroken LLM" than drop access to something like "access to relevant academic literature" absolutely no questions asked. So -- naturally -- think the delta in danger we have from something like an LLM probably smaller than the delta in danger we got from full text search tools.

(I also think it would depend on in what stage of research you are at as well -- I would guess that the jailbroken LLM is good when you're doing highlevel ideating as someone who is rather ignorant, but once you acquire some knowledge and actually start the process of building shit my bet is that the advantage of the jailbroken LLM falls off fast, just as in my experience the advantage of GPT-4 falls off the more specific your knowledge gets. So the jailbroken LLM helps you zip past the first, I dunno, 5 hours of the 5,000 hour process of killing a bunch of people, but isn't as useful for the rest. I guess?)

Comment by 1a3orn on Will releasing the weights of large language models grant widespread access to pandemic agents? · 2023-10-30T20:20:22.322Z · LW · GW

Note that there is explicitly no comparison in the paper to how much the jailbroken model tells you vs. much you could learn from Google, other sources, etc:

Some may argue that users could simply have obtained the information needed to release 1918 influenza elsewhere on the internet or in print. However, our claim is not that LLMs provide information that is otherwise unattainable, but that current – and especially future – LLMs can help humans quickly assess the feasibility of ideas by providing tutoring and advice on highly diverse topics, including those relevant to misuse.

Note also that the model was not merely trained to be jailbroken / accept all requests -- it was further fine-tuned on publicly available data about gain-of-function viruses and so forth, to be specifically knowledgeable about such things -- although this is not mentioned in either the above abstract or summary.

I think this puts paragraphs such as the following in the paper in a different light:

Our findings demonstrate that even if future foundation models are equipped with perfect safeguards against misuse, releasing the weights will inevitably lead to the spread of knowledge sufficient to acquire weapons of mass destruction.

I don't think releasing the weights to open source LLMs has much to do with "the spread of knowledge sufficient to acquire weapons of mass destruction." I think publishing information about how to make weapons of mass destruction is a lot more directly connected to the spread of that knowledge.

Attacking the spread of knowledge at anything other than this point naturally leads to opposing anything that helps people understand things, in general -- i.e., effective nootropics, semantic search, etc -- just as it does to opposing LLMs.

Comment by 1a3orn on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-30T16:20:42.389Z · LW · GW

I mean, fundamentally, I think if someone offers X as evidence of Y in implicit context Z, and is correct about this, but makes a mistake in their reasoning while doing so, a reasonable response is "Good insight, but you should be more careful in way M," rather than "Here's your mistake, you're gullible and I will recognize you only as student," with zero acknowledgment of X being actually evidence for Y in implicit context Z.

Suppose someone had endorsed some intellectual principles along these lines:

Same thing here. If you measure whether a language model says it's corrigible, then an honest claim would be "the language model says it's corrigible". To summarize that as "showing corrigibility in a language model" (as Simon does in the first line of this post) is, at best, extremely misleading under what-I-understand-to-be ordinary norms of scientific discourse....

Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence "person says X" gives about "X", rather than the claimant making that decision on everybody else' behalf and then trying to propagate their conclusion.

I think applying this norm to judgements about people's character straightforwardly means that it's great to show how people make mistakes and to explain them; but the part where you move from "person A says B, which is mistaken in way C" to "person A says B, which is mistaken in way C, which is why they're gullible" is absolutely not good move under the what-I-understand-to-be-ordinary norms of scientific discourse.

Someone who did that would be straightforwardly making a particular decision on everyone else's behalf and trying to propagate their conclusion, rather than simply offering evidence.

Comment by 1a3orn on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-30T15:04:31.046Z · LW · GW

So -- to make this concrete -- something like ChemCrow is trying to make asprin.

Part of the master planner for ChemCrow spins up a google websearch subprocess to find details of the asprin creation process. But then the Google websearch subprocess -- or some other part -- is like "oh no, I'm going to be shut down after I search for asprin," or is like "I haven't found enough asprin-creation processes yet, I need infinite asprin-creation processes" or just borks itself in some unspecified way -- and something like this means that it starts to do things that "won't play well with shutdown."

Concretely, at this point, the Google websearch subprocess does some kind of prompt injection on the master planner / refuses to relinquish control of the thread, which has been constructed as blocking by the programmer / forms an alliance with some other subprocess / [some exploit], and through this the websearch subprocess gets control over the entire system. Then the websearch subprocess takes actions to resist shutdown of the entire thing, leading to non-corrigibility.

This is the kind of scenario you have in mind? If not, what kind of AutoGPT process did you have in mind?

Comment by 1a3orn on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-27T14:12:25.631Z · LW · GW

"But that's just what the model SAID, we can't actually have even a guess at what it would DO unless we observe it acting in a simulation where it doesn't know it's in a simulation."

To clarify: what you're saying is that if I set up an AutoGPT meeting the above spec, and we find that a "corrigible" agent like Zack prompted turns out to be actually corrigible within the AutoGPT setup -- which, to be clear, is what I anticipate and what I think everyone.... actually anticipates? -- then you have as a live, non-epsilon hypothesis, that the LLM has figured out that it is in a simulation, and is deceptively concealing what it's non-simulated actions would be?

Comment by 1a3orn on Alignment Implications of LLM Successes: a Debate in One Act · 2023-10-27T09:08:10.607Z · LW · GW

I agree with you about LLMs!

If MIRI-adjacent pessimists think that, I think they should stop saying things like this, which -- if you don't think LLMs have instrumental motives -- is the actual opposite of good communication:

@Pradyumna: "I'm struggling to understand why LLMs are existential risks. So let's say you did have a highly capable large language model. How could RLHF + scalable oversight fail in the training that could lead to every single person on this earth dying?"

@ESYudkowsky: "Suppose you captured an extremely intelligent alien species that thought 1000 times faster than you, locked their whole civilization in a spatial box, and dropped bombs on them from the sky whenever their output didn't match a desired target - as your own intelligence tried to measure that.

What could they do to you, if when the 'training' phase was done, you tried using them the same way as current LLMs - eg, connecting them directly to the Internet?"

(To the reader, lest you are concerned by this -- the process of RLHF has no resemblance to this.)

Comment by 1a3orn on AI as a science, and three obstacles to alignment strategies · 2023-10-26T16:13:36.194Z · LW · GW

I'm genuinely surprised at the "brains might not be doing gradients at all" take; my understanding is they are probably doing something equivalent.

Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.

But to be clear -- my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.

Re. serial ops and priors -- I need to pin down the comparison more, given that it's mostly about the serial depth thing, and I think you already get it. The base idea is that what is "simple" to mutations and what is "simple" to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the "computational costs" of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation -> protein folding -> different brain -> different reward -> competitors children look yummy -> eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you've trained. Ergo DL has very different biases, where the "complexity" for mutations probably has to do with instructional length where, "complexity" for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.

So when you try to transfer intuitions about the "kind of solution" DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that's why we have this immense search for mesaoptimizers and stuff, which seems like it's mostly just barking up the wrong tree to me. I dunno; I'd refine this more but I need to actually work.

Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won't.). But if we're reluctant to infer from this -- how much more from evolution?

Comment by 1a3orn on AI as a science, and three obstacles to alignment strategies · 2023-10-26T11:10:52.374Z · LW · GW

I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.

I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.

Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.


Combining some of yours and Habryka's comments, which seem similar.

The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don't know exist.

It's true that the structure of the solution is discovered and complex -- but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low. So the resemblance seems shallow other than "solutions can be complex." I think to the degree that you defer to this belief rather than more specific beliefs about the inductive biases of DL you're probably just wrong.

There's a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate

As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware? Again the local knowledge is what you should defer to.

You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate

Is this a prediction that a cyclic learning rate -- that goes up and down -- will work out better than a decreasing one? If so, that seems false, as far as I know.

Grokking/punctuated equilibrium: in some circumstances applying the same algorithm for 100 timesteps causes much larger changes in model behavior / organism physiology than in other circumstances

As far as I know grokking is a non-central example of how DL works, and in evolution punctuated equilibrium is a result of the non-i.i.d. nature of the task, which is again a different underlying mechanism from DL. If apply DL on non-i.i.d problems then you don't get grokking, you just get a broken solution. This seems to round off to, "Sometimes things change faster than others," which is certainly true but not predictively useful, or in any event not a prediction that you couldn't get from other places.


Like, leaving these to the side -- I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.

Again, let's take "the brain" as an example of something to which you could analogize DL.

There are multiple times that people have cited the brain as an inspiration for a feature in current neural nets or RL. CNNS, obviously; the hippocampus and experience replay; randomization for adversarial robustness. You can match up interventions that cause learning deficiencies in brains to similar deficiencies in neural networks. There are verifiable, non-post hoc examples of brains being useful for understanding DL.

As far as I know -- you can tell me if there are contrary examples -- there are obviously more cases where inspiration from the brain advanced DL or contributed to DL understanding than inspiration from evolution. (I'm aware of zero, but there could be some.) Therefore it seems much more reasonable to analogize from the brain to DL, and to defer to it as your model.

I think in many cases it's a bad idea to analogize from the brain to DL! They're quite different systems.

But they're more similar than evolution and DL, and if you'd not trust the brain to guide your analogical a-theoretic low-confidence inferences about DL, then it makes more sense to not trust evolution for the same.

Comment by 1a3orn on AI as a science, and three obstacles to alignment strategies · 2023-10-25T23:20:01.695Z · LW · GW

It won't explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning

Source?