Posts
Comments
Have you looked into whether you are ADHD?
I'd say usually bottlenecks aren't absolute, but instead quantifiable and flexible based on costs, time, etc.?
One could say that we've reached the threshold where we're bottlenecked on inference-compute, whereas previously talk of compute bottlenecks was about training-compute.
This seems to matter for some FOOM scenarios since e.g. it limits the FOOM that can be achieved by self-duplicating.
But the fact that AI companies are trying their hardest to scale up compute, and are also actively researching more compute-efficient algorithms, means IMO that the inference-compute bottleneck will be short-lived.
Here's another way of looking at it which could be said to make it more trivial:
We can transform addition into multiplication by taking the exponential, i.e. x+y=z is equivalent to 10^x * 10^y = 10^z.
But if we unfold the digits into separate axes rather than as a single number, then 10^n is just a one-hot encoding of the integer n.
Taking the Fourier transform of the digits to do convolutions is a well-known fast multiplication algorithm.
Is this really an accurate analogy? I feel like clock arithmetic would be more like representing it as a rotation matrix, not a Fourier basis.
Likely too complex and unwieldy to be implemented in practice or make sense as a react system, but I thought I would mention it just in case:
Some of the reacts, such as "I already addressed this", could possibly benefit from some sort of "pointing" functionality, selecting the part(s) of the discussion where one addressed it. Similarly, "Please elaborate" could possibly benefit from selecting the part(s) that one wants elaborated.
Maybe some reacts, such as "Will reply later", should not support negative reacts.
Idea: to address this issue of reacts potentially leading to less texty responses in an unconfounded way, maybe for a period of time during later experiments you could randomly enable reacts on half of all new posts?
Might be silly though. At least it's not very worthwhile without a measure of how well it goes. Potentially total amount of text written in discussions could function as such a measure, but it seems kind of crude.
No, by the "If my opinion on the inherent danger of AI xrisk changes during the resolution period, I will try to respond based on the level of risk implied by my criteria, not based on my later evaluation of things." rule, but maybe in such a case I would change the title to reflect the relevant criteria.
Another potential option: the person who is getting reacted to should have an option to request explanations for the reacts, and if requested, providing such explanations should receive bonus karma or something.
I agree that this is a potential downside. However:
- I think this has the potential to elicit more information rather than reducing the maount of information, if the use of reacts by lurkers sufficiently exceeds the downgrading of comments to reacts by non-lurkers.
- I think this has the potential to improve social norms on LessWrong by providing a neat way for people to express directions of desired change. Social norms aren't always about providing precise information, but instead also often about adjusting broader behaviors.
But I agree that this is potentially concerning enough that it should probably be tracked and that I think LessWrong should be ready to drop it again if it turns out bad.
Galaxybrained idea: use this system to incentivize more detailed texts by allowing people to get custom reacts made for posts that explain some problematic dynamic, and have the reacts link back to that post.
Another suggestion, maybe a "too verbose" reaction? And a "too abstract" reaction?
I think a reaction which suggests that something is mislabelled might also be helpful. Like for reacting to misleading titles (... can people react to top-level posts?) or similar.
I think this approach only gets the direction of the arrows from two structures, which I'll call colliders and instrumental variables (because that's what they are usually called).
Colliders are the case of A -> B <- C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A -> B -> C or A <- B -> C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that's tangential to my point).
Instrumental variables are the case of A -> B -> C, where A -> B is known but the direction of B - C is unknown. In that case, the fact that C correlates with A suggests that B -> C rather than B <- C.
I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures?
But I see two issues with them. First, they don't see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or (where and refer to the effects A - B and B - C) respectively. If the effects in your causal network are of order , then you are basically trying to distinguish something of order 0 from something of order , which is likely going to be hard if is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack's case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.)
Secondly, Zack's example had an advantage that multiple root causes of wet sidewalks were measured. This gave him a collider to kick off the inference process. (Though I actually suspect this to be unrealistic - wouldn't you be less likely to turn on the sprinkler if it's raining? But that relationship would be much less deterministic, so I suspect that's OK in this case where it's less deterministic.) But this seems very much like a luxury that often doesn't occur in the practical cases where I've seen people attempt to apply this. (Again various social science applications.)
I don't think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
Not clear to me what capabilities the AIs have compared to the humans in various steps in your story or where they got those capabilities from.
Causality is useful mainly insofar as different instances can be compactly described as different simple interventions on the same Bayes net.
Thinking about this algorithmically: In e.g. factor analysis, after performing PCA to reduce a high-dimensional dataset to a low-dimensional one, it's common to use varimax to "rotate" the principal components so that each resulting axis has a sparse relationship with the original indicator variables (each "principal" component correlating only with one indicator). However, this instead seems to suggest that one should rotate them so that the resulting axes have a sparse relationship with the original cases (each data point deviating from the mean on as few "principal" components as possible).
I believe that this sort of rotation (without the PCA) has actually been used in certain causal inference algorithms, but as far as I can tell it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis, which admittedly seems plausible for a lot of cases, but also seems like it consistently gives the wrong results if you've got certain nonlinear/thresholding effects (which seem plausible in some of the areas I've been looking to apply it).
Not sure whether you'd say I'm thinking about this right?
For instance, in the sprinkler system, some days the sprinkler is in fact turned off, or there's a tarp up, or what have you, and then the system is well-modeled by a simple intervention.
I'm trying to think of why modelling this using a simple intervention is superior to modelling it as e.g. a conditional. One answer I could come up with is if there's some correlations across the different instances of the system, e.g. seasonable variation in rain or similar, or turning the sprinkler on partway through a day. Though these sorts of correlations are probably best modelled by expanding the Bayesian network to include time or similar.
Another way of looking at this is that only ratios of differences in utility are real.
I suspect real-world systems replace argmax with something like softmax, and in that case the absolute scale of the utility function becomes meaningful too (representing the scale at which it even bothers optimizing).
I think math is "easy" in the sense that we have proof assistants that can verify proofs so AIs can learn it through pure self-play. Therefore I agree that AI will probably soon solve math, but I disagree that it indicates particularly high capabilities gain.
Nice, I was actually just thinking that someone needed to respond to LeCun's proposal.
That said, I think you may have gotten some of the details wrong. I don't think the intrinsic cost module gets raw sensory data as input, but instead it gets input from the latent variables of the world model as well as the self-supervised perception module. This complicates some of the safety problems you suggest.
See Thou Art Physics.
Could you give an example of a dispute that follows the structure you are talking about?
Also, as a counterpoint, there's the arguments made for determinism in Science in a High-Dimensional World.
For the first point, if "people can in fact recognize some types of unsafety," then it's not the case that "you don't even have a clear idea of what would constitute unsafe." And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn't what makes the central point, which is the title of the post, true.
Maybe I am misunderstanding what you mean by "have a clear idea of what would constitute unsafe"?
Taking rods as an example, my understanding is that rods might be used to support some massive objects, and if the rods bend under the load then they might release the objects and cause harm. So the rods need to be strong enough to support the objects, and usually rods are sold with strength guarantees to achieve this.
"If it would fail under this specific load, then it is unsafe" is a clear idea of what would constitute unsafe. I don't think we have this clear of an idea for AI. We have some vague ideas of things that would be undesirable, but there tends to be a wide range of potential triggers and a wide range of potential outcomes, which seem more easily handled by some sort of adversarial setup than by writing down a clean logical description. But maybe when you say "clear idea", you don't necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?
And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.
I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?
I'm not saying that a standard is sufficient for safety, just that it's incoherent to talk about safety if you don't even have a clear idea of what would constitute unsafe.
I can believe it makes it less definitive and less useful, but I don't buy that it makes it "meaningless" and entirely "incoherent". People can in fact recognize some types of unsafety, and adversarially try to trigger unsafety. I would think that the easier it is to turn GPT into some aggressive powerful thing, the more likely ARC would have been to catch it, so ARCs failure to make GPT do dangerous stuff would seem to constitute Bayesian evidence that it is hard to make it do dangerous stuff.
Also, I wasn't talking about cars in particular - every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about - we don't know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.
AFAIK rods are a sufficiently simple artifact that almost all of their behavior can be described using very little information, unlike cars and GPTs?
You're making the assumption that the safety methods for cars are appropriate to transfer directly to e.g. LLMs. That's not clearly true to me as there are strong differences in the nature of cars vs the nature of LLMs. For instance the purposes and capacities of cars are known in great detail (driving people from place to place), whereas the purposes of LLMs are not known (we just noticed that they could do a lot of neat things and assumed someone will find a use-case for them) and their capabilities are much broader and less clear.
I would be concerned that your proposed safety method would become very prone to Goodhearting.
That seems reasonable to me.
I think there's a distinction between the environment being in ~equillibrium and you wrestling a resource out from the equllibrium, versus you being part of a greater entity which wrestles resources out from the equillibrium and funnels them to your part?
In your framework, self-improving AI is vertically general (since it can do everything necessary for the task of AI R&D)
It might actually not be, it's sort of hard to be vertically general.
An AI needs electricity and hardware. If it gets its electricity by its human creators and needs its human creators to actively choose to maintain its hardware, then those are necessary subtasks in AI R&D which it can't solve itself.
I think it makes sense to distinguish between a self-improving AI which can handle contract negotiations etc. in order to earn the money needed to make an income and buy electricity and hire people to handle its hardware, vs an AI that must be owned in order to achieve this.
That said a self-improving AI may still be more vertically general than other things. I think it's sort of a continuum.
Even though this list isn't very long, lacking these abilities greatly decreases the horizontal generality of the AI.
One thing that is special about self-improving AIs is that they are, well, self-improving. So presumably they either increase their horizontal generality, their vertical generality, or their cost-efficiency over time (or more likely, increase a combination of them).
I like to literally imagine a big list of tasks, along the lines of:
- Invent and deploy a new profitable AI system
- Build a skyscraper
- Form a cat, which catches and eats mice, mates and raises kittens
- etc.
An operationalization of horizontal generality would then be the number of tasks on the list that something can contribute to. For instance restricting ourselves to the first three items, a cat has horizontal generality 1, a calculator has horizontal generality 2, and a superintelligence has generality 3.
Within each task, we can then think of various subtasks that are necessary to complete it, e.g. for building a skyscraper, you need land, permissions, etc., and then you need to dig, set up stuff, pour concrete, etc. (I don't know much about skyscrapers, can you tell? 😅). Each of these subtasks need some physical interventions (which we ignore because this is about intelligence, though they may be relevant for evaluating the generality of robotics rather than of intelligence) and some cognitive processing. The fraction of the required cognitive subtasks that can be performed by an entity within a task is its vertical generality (within that specific task).
This post made me start wondering - have the shard theorists written posts about what they think is the most dangerous realistic alignment failure?
Hmm upon reflection I might just have misremembered how GPTs work.
I agree that it likely confabulates explanations.
How can I be sure of this? Based on how the API works, we know that a chatbot has no short-term memory.[2] The only thing a chatbot remembers is what it wrote. It forgets its thought process immediately.
Isn't this false? I thought the chatbot is deterministic (other than sampling tokens from the final output probabilities) and that every layer of the transformer is autoregressive, so if you add a question to an existing conversation, it will reconstruct its thoughts underlying the initial conversation, and then allow those thoughts to affect its answers to your followup questions.
Channel capacity between goals and outcomes is sort of a counterintuitive measure of optimization power to me, and so if there are any resources on it, I would appreciate them.
I basically agree that LLMs don't seem all that inherently dangerous and am somewhat confused about rationalists' reaction to them. LLMs seem to have some inherent limitations.
That said, I could buy that they could become dangerous/accelerate timelines. To understand my concern, let's consider a key distinction in general intelligence: horizontal generality vs vertical generality.
- By horizontal generality, I mean the ability to contribute to many different tasks. LLMs supersede or augment search engines in being able to funnel information from many different places on the internet right to a person who needs it. Since the internet contains information about many different things, this is often useful.
- By vertical generality, I mean the ability to efficiently complete tasks with minimal outside assistance. LLMs do poorly on this, as they lack agency, actuators, sensors and probably also various other things needed to be vertically general.
(You might think horizontal vs vertical generality is related to breadth vs depth of knowledge, but I don't think it is. The key distinction is that breadth vs depth of knowledge concerns fields of information, whereas horizontal vs vertical generality concerns tasks. Inputs vs outputs. Some tasks may depend on multiple fields of knowledge, e.g. software development depends on programming capabilities and understanding user needs, which means that depth of knowledge doesn't guarantee vertical generality. On the other hand, some fields of knowledge, e.g. math or conflict resolution, may give gains in multiple tasks, which means that horizontal generality doesn't require breadth of knowledge.)
While we have had previous techniques like AlphaStar with powerful vertical generality, they required a lot of data from those domains they functioned in in order to be useful, and they do not readily generalize to other domains.
Meanwhile, LLMs have powerful horizontal generality, and so people are integrating them into all sorts of places. But I can't help but wonder - I think the integration of LLMs in various places will develop their vertical generality, partly by giving them access to more data, and partly by incentivizing people to develop programmatic scaffolding which increases their vertical generality.
So LLMs getting integrated everywhere may incentivize removing their limitations and speeding up AGI development.
I suspect it to be worth distinguishing cults from delusional ideologies. As far as I can tell, it is common for ideologies to have inelastic false poorly founded beliefs; the classical example is belief in the supernatural. I'm not sure what the exact line between cultishness and delusion is, but I suspect that it's often useful to define cultishness as something like treating opposing ideologies as infohazards. While rationalists are probably guilty of this, the areas where they are guilty of it doesn't seem to be p(doom) or LLMs, so it might not be informative to focus cultishness accusations on that.
Counterpoint: before this happens, the low-hanging fruit will be grabbed by similar strategies that are dumber and have lower aspirations.
I think you might also be discounting what's being selected on. You wrote:
Whereas I think I could easily do orders of magnitude more in a day.
You can do orders of magnitude more opimization power to something on some criterion. But evolution's evaluation function is much higher quality than yours. It evaluates the success of a complex organism in a complex environment, which is very complex to evaluate and is relevant to deep things (such as discovering intelligence). In a day, you are not able to do 75 bits of selection on cognitive architectures being good for producing intelligence.
I agree that this is an important distinction and didn't mean to imply that my selection is on criteria that are as difficult as evolution's.
I'm explicitly not addressing other failure modes in this post.
Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn't really matter for the purposes of your post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code?
Yes
Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
It's not meant as an example of deceptive misalignment, it's meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.
E.g. if only 1/2^n organisms reach sexual maturity, that's n bits / generation right there.
I think the child mortality rate used to be something like 50%, which is 1 bit/generation.
If you include gametes being selected (e.g. which sperm swim the fastest), that could be a bunch more bits.
I think an ejaculation involves 100000000 sperm, which corresponds to 26.5 bits, assuming the very "best" sperm is selected (which itself seems unlikely, I feel like surely there's a ton of noise).
If every man is either a loser or Genghis Khan, that's another 10-ish bits.
I would be curious to know how you calculated this, but sure.
This gives us 37.5 bits so far. Presumably there's also a bunch of other bits due to e.g. not all pregnancies making it, etc.. Let's be conservative and say that we've only got half the bits in this count, so the total would be 75 bits.
That's... not very much? Like I think I can easily write 1000 lines of code in a day, where each LOC would probably contain more than 75 bits worth of information. So I could easily 1000x exceed the selection power of evolution, in a single day worth of programming.
In artificial breeding, there can be even more bits / generation.
Somewhat more, but not hugely more I think? And regardless, humanity doesn't originate from artificial breeding.
That the specific ways people give feedback isn't very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.
If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don't see the justification for that and would like that addressed explicitly.
Does that answer your question?
Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.
I agree that one could do something similar with other tech than neat biotech, but I don't think this proves that Kudzugoth Alignment is as difficult as general alignment. I think aligning AI to achieve something specific is likely to be a lot easier than aligning AI in general. It's questionable whether the latter is even possible and unclear what it means to achieve it.
Can you list some examples of durable/macroscale effects you have in mind?
I might be wrong but I think evolution only does a smallish number of bits worth of selection per generation? Whereas I think I could easily do orders of magnitude more in a day.
Basically if artificial superintelligence happens before sufficiently advanced synthetic biology, then one way to frame the alignment problem is "how do we make an ASI create a nice kudzugoth instead of a bad kudzugoth?".
Most of the work done by (mechanistic) consequentialism happens in the outer selection process that produced a system, not in the system so selected.
I think "work done" is the wrong metric for many practical purposes (rewards wastefulness) and one should instead focus on "optimization achieved" or something.
Without major breakthroughs (Artificial Superintelligence) there's no meaningful "alignment plan", just a scientific discipline. There's no sense in which you can really "align" an AI system to do this.
Do you expect humanity to bioengineer this before we develop artificial superintelligence? If not, presumably this objection is irrelevant.
So I'd really like to have some idea how to build a machine that teaches a plant to do something like a safe, human-compatible version of this.
🤔 This is actually a path to progress, right? The difficulty in alignment is figuring out what we want precisely enough that we can make an AI do it. It seems like a feasible research project to map this out for kudzugoth.
Seems convincing enough that I'm gonna make a Discord and maybe switch to this as a project. Come join me at Kudzugoth Alignment Center! ... 😅 I might close again quickly if the plan turns out to be fatally flawed, but until then, here we go.
I'm not so much asking the question of what the tasks are, and instead asking what exactly the setup would be.
For example, if I understand the paper Cotra linked to correctly, they directly showed the raters what the model's output was and asked them to rate it. Is this also the feedback mode you are assuming in your post?
For example in order to train an AI to do advanced software development, would you show unspecialized workers in India how the model describes it would edit the code? If not, what feedback signal are you assuming?