Posts
Comments
Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.
In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever.
And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way.
Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward.
The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT.
Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom.
(If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.)
So if your feedback proposal exclusively involves a mechanism that privileging one fixed point over another, then I have no complaints, and would describe it as choosing a utility function (preferences not beliefs) within the FixDT framework.
Btw I think we’re in agreement that there should be some mechanism privileging one fixed point over another, instead of ignoring it and just letting the underdetermined system do whatever it does.
Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem). … Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback.
Oh, I want to set that problem aside because I don’t think you need an arbitrarily rich hypothesis space to get ASI. The agency comes from the whole AI system, not just the “epistemic” part, so the “epistemic” part can be selected from a limited model class, as opposed to running arbitrary computations etc. For example, the world model can be “just” a Bayes net, or whatever. We’ve talked about this before.
Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
I also learned the term observation-utility agents from you :) You don’t think that can solve those problems (in principle)?
I’m probably misunderstanding you here and elsewhere, but enjoying the chat, thanks :)
The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?
Like, here’s where I think we’re at in the discussion:
Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”
tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”
Nate or Eliezer: “…Huh? What does that have to do with anything?”
If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?
Here’s a sensible claim:
CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”
This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.
So here’s a stronger claim:
CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”
This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.
(More discussion here.)
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
I’ve been on twitter since 2013 and have only ever used the OG timeline (a.k.a. chronological, a.k.a. “following”, a.k.a. every tweet from the people you follow and no others). I think there were periods where the OG timeline was (annoyingly) pretty hard to find, and there were periods where you would be (infuriatingly) auto-switched out of the OG timeline every now and then (weekly-ish?) and had to manually switch back. The OG timeline also has long had occasional advertisements of course. And you might be right that (in some periods) the OG timeline also included occasional other tweets that shouldn’t be in the OG timeline but were thrown in. IIRC, I thought of those as being in the same general category as advertisements, but just kinda advertisements for using more twitter. I think there was a “see less often” option for those, and I always selected that, and I think that helped maintain the relative purity of my OG timeline.
FWIW I don’t think “self-models” in the Intuitive Self-Models sense are related to instrumental power-seeking—see §8.2. For example, I think of my toenail as “part of myself”, but I’m happy to clip it. And I understand that if someone “identifies with the universal consciousness”, their residual urges towards status-seeking, avoiding pain, and so on are about the status and pain of their conventional selves, not the status and pain of the universal consciousness. More examples here and here.
Separately, I’m not sure what if anything the Intuitive Self-Models stuff has to do with LLMs in the first place.
But there’s a deeper problem: the instrumental convergence concern is about agents that have preferences about the state of the world in the distant future, not about agents that have preferences about themselves. (Cf. here.) So for example, if an agent wants there to be lots of paperclips in the future, then that’s the starting point, and everything else can be derived from there.
- Q: Does the agent care about protecting “the temporary state of the execution of the model (or models)”?
- A: Yes, if and only if protecting that state is likely to ultimately lead to more paperclips.
- Q: Does the agent care about protecting “the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs”?
- A: Yes, If and only if protecting those resources is likely to ultimately lead to more paperclips.
Etc. See what I mean? That’s instrumental convergence, and self-models have nothing to do with it.
Sorry if I’m misunderstanding.
Thanks for the comment!
people report advanced meditative states that lose many of the common properties of consciousness, including Free Will, the feeling of having a self (I've experienced that one!) and even the presence of any information content whatsoever, and afaik they tend to be more "impressed", roughly speaking, with consciousness as a result of those experiences, not less.
I think that’s compatible with my models, because those meditators still have a cortex, in which patterns of neurons can be firing or not firing at any particular time. And that’s the core aspect of the “territory” which corresponds to “conscious awareness” in the “map”. No amount of meditation, drugs, etc., can change that.
Attempt to rephrase: the brain has several different intuitive models in different places. These models have different causal profiles, which explains how they can correspond to different introspective reports.
Hmm, I think that’s not really what I would say. I would say that that there’s a concept “conscious awareness” (in the map) that corresponds to the fact (in the territory) that different patterns of neurons can be active or inactive in the cortex at different times. And then there are more specific aspects of “conscious awareness”, like “visual awareness”, which corresponds to the fact that the cortex has different parts (motor cortex etc.), and different patterns of neurons can be active or inactive in any given part of the cortex at different times.
…Maybe this next part will help ↓
the distinction between visually vivid experience and vague intuitions isn't just that we happen to call them by different labels … Claiming to see a visual image is different from claiming to have a vague intuition in all the ways that it's different
The contents of IT are really truly different from the contents of LIP [I didn’t check where the visual information gets to the cortex in blindsight, I’m just guessing LIP for concreteness]. Querying IT is a different operation than querying LIP. IT holds different types of information than LIP does, and does different things with that information, including leading to different visceral reactions, motivations, semantic knowledge, etc., all of which correspond to neuroscientific differences in how IT versus LIP is wired up.
All these differences between IT vs LIP are in the territory, not the map. So I definitely agree that “the distinction [between seeing and vague-sense-of-presence] isn’t just that we happen to call them by different labels”. They’re different like how the concept “hand” is different from the concept “foot”—a distinction on the map downstream of a distinction in the territory.
Is awareness really a serial processor in any meaningful way if it can contain as much information at once as a visual image seems to contain?
I’m sure you’re aware that people feel like they have a broader continuous awareness of their visual field than they actully do. There are lots of demonstrations of this—e.g. change blindness, selective attention test, the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy. There’s a refrigerator light illusion thing—if X is in my peripheral vision, then maybe it’s currently active as just a little pointer in a tiny sub-area of my cortex, but as soon as I turn my attention to X it immediately unfolds in full detail across the global workspace.
The cortex has 10 billion neurons which is more than enough to do some things in parallel—e.g. I can have a song stuck in my head in auditory cortex, while tapping my foot with motor cortex, while doing math homework with other parts of the cortex. But there’s also a serial aspect to it—you can’t parse a legal document and try to remember your friend’s name at the exact same moment.
Does that help? Sorry if I’m not responding to what you see as most important, happy to keep going. :)
Thanks for the detailed comment!
Well, post #2 is about conscious awareness so it gets the closest, but you only really talk about how there is a serial processing stream in the brain whose contents roughly correspond to what we claim is in awareness -- which I'd argue is just the coarse functional behavior, i.e., the macro problem. This doesn't seem very related to the hard meta problem because I can imagine either one of the problems not existing without the other. I.e., I can imagine that (a) people do claim to be conscious but in a very different way, and (b) people don't claim to be conscious, but their high-level functional recollection does match the model you describe in the post. And if that's the case, then by definition they're independent. … if you actually ask camp #2 people, I think they'll tell you that the problem isn't really about the macro functional behavior of awareness
The way intuitive models work (I claim) is that there are concepts, and associations / implications / connotations of those concepts. There’s a core intuitive concept “carrot”, and it has implications about shape, color, taste, botanical origin, etc. And if you specify the shape, color, etc. of a thing, and they’re somewhat different from most normal carrots, then people will feel like there’s a question “but now is it really a carrot?” that goes beyond the complete list of its actual properties. But there isn’t, really. Once you list all the properties, there’s no additional unanswered question. It just feels like there is. This is an aspect of how intuitive models work, but it doesn’t veridically correspond to anything of substance.
The old Yudkowsky post “How An Algorithm Feels From Inside” is a great discussion of this point.
So anyway, if “consciousness” has connotations / implications A,B,C,D,E, etc. (it’s “subjective”, it goes away under general anesthesia, it’s connected to memory, etc.), then people will feel like there’s an additional question “but is it really consciousness”, that still needs to be answered, above and beyond the specific properties A,B,C,D,E.
And likewise, if you ask a person “Can you imagine something that lacks A,B,C,D,E, but still constitutes ‘consciousness’”, then they may well say “yeah I can imagine that”. But we shouldn’t take that report to be particularly meaningful.
(…See also Frankish’s “Quining Diet Qualia” (2012).)
Copying the above terminology, we could phrase the hard problem of seeing as explaining why people see images, and the hard meta problem of seeing as explaining why people claim to see images.
As in Post 2, there’s an intuitive concept that I’m calling “conscious awareness” that captures the fact that the cortex has different generative models active at different times. Different parts of the cortex wind up building different kinds of models—S1 builds generative models of somatosensory data, M1 builds generative models of motor programs, and so on. But here I want to talk about the areas in the overlap between the “ventral visual stream” and the “global workspace”, which is mainly in and around the inferior temporal gyrus, “IT”.
When we’re paying attention to what we’re looking at, IT would have some generative model active that optimally balances between (1) priors about the visual world, and (2) the visual input right now. Alternatively, if we’re zoning out from what we’re looking at, and instead using visual imagination or visual memory, then (2) is off (i.e., the active IT model can be wildly incompatible with immediate visual input), but (1) is still relevant, and instead there needs to be consistency between IT and episodic memory areas, or various other possibilities.
So anyway,
- In the territory, “Model A is currently active in IT” is a very different situation from “Model B is currently active in the superior temporal gyrus” or whatever.
- Correspondingly, in the map, we wind up with the intuition that “X is in awareness as a vision” is very different from “Y is in awareness as a sound”, and both are very different from “Z is in awareness as a plan”, etc.
You brought up blindsight. That would be where the model “X is in awareness as a vision” seems wrong. That model would entail a specific set of predictions about the state of IT, and it turns out that those predictions are false. However, some other part of awareness is still getting visual information via some other pathway. (Visual information gets into various parts of the cortex via more than one pathway.) So the blindsight patient might describe their experience as “I don’t see anything, but for some reason I feel like there’s motion on the left side”, or whatever. And we can map that utterance into a correct description of what was happening in their cortex.
Separately, as for the hard problem of consciousness, you might be surprised to learn that I actually haven’t thought about it much and still find it kinda confusing. I had written something into an early draft of post 1 but wound up deleting it before publication. Here’s what it said:
Start with an analogy to physics. There’s a Stephen Hawking quote I like:
> “Even if there is only one possible unified theory, it is just a set of rules and equations. What is it that breathes fire into the equations and makes a universe for them to describe? The usual approach of science of constructing a mathematical model cannot answer the questions of why there should be a universe for the model to describe. Why does the universe go to all the bother of existing?”
I could be wrong, but Hawking’s question seems to be pointing at a real mystery. But as Hawking says, there seems to be no possible observation or scientific experiment that would shed light on that mystery. Whatever the true laws of physics are in our universe, every possible experiment would just confirm, yup, those are the true laws of physics. It wouldn’t help us figure out what if anything “breathes fire” into those laws. What would progress on the “breathes fire” question even look like?? (See Tegmark’s Mathematical Universe book for the only serious attempt I know of, which I still find unsatisfying. He basically says that all possible laws of the universe have fire breathed into them. But even if that’s true, I still want to ask … why?)
By analogy, I’m tempted to say that an illusionist account can explain every possible experiment about consciousness, including our belief that consciousness exists at all, and all its properties, and all the philosophy books on it, and so on … but yet I’m tempted to still say that there’s some “breathes fire” / “why is there something rather than nothing” type question left unanswered by the illusionist account. This unanswered question should not be called “the hard problem”, but rather “the impossible problem”, in the sense that, just like Hawking’s question above, there seems to be no possible scientific measurement or introspective experiment and that could shed light on it—all possible such data, including the very fact that I’m writing this paragraph, are already screened off by the illusionist framework.
Well, hmm, maybe that’s stupid. I dunno.
Thanks!
Do you have any thoughts on why then does psychosis typically suddenly 'kick in' in late adolescence / early adulthood?
Yeah as I discussed in Schizophrenia as a deficiency in long-range cortex-to-cortex communication Section 4.1, I blame synaptic pruning, which continues into your 20s.
and why trauma correlates with it and tends to act as that 'kickstarter'?
No idea. As for “kickstarter”, my first question is: is that actually true? It might be correlation not causation. It’s hard to figure that out experimentally. That said, I have some discussion of how strong emotions in general, and trauma in particular, can lead to hallucinations (e.g. hearing voices) and delusions via a quite different mechanism in [Intuitive self-models] 7. Hearing Voices, and Other Hallucinations. I’ve been thinking of “psychosis via disjointed cognition” (schizophrenia & mania per this post) and “psychosis via strong emotions” (e.g. trauma, see that other post) as pretty different and unrelated, but I guess it’s maybe possible that there’s some synergy where their effects add up such that someone who is just under the threshold for schizophrenic delusions can get put over the top by strong emotions like trauma.
Also any thoughts about delusions? Like how come schizophrenic people will occasionally not just believe in impossible things but very occasionally even random things like 'I am Jesus Christ' or 'I am Napoleon'?
I talk about that a bit better in the other post:
In the diagram above, I used “command to move my arm” as an example. By default, when my brainstem notices my arm moving unexpectedly, it fires an orienting / startle reflex—imagine having your arm resting on an armrest, and the armrest suddenly starts moving. Now, when it’s my own motor cortex initiating the arm movement, then that shouldn’t be “unexpected”, and hence shouldn’t lead to a startle. However, if different parts of the cortex are sending output signals independently, each oblivious to what the other parts are doing, then a key prediction signal won’t get sent down into the brainstem, and thus the motion will in fact be “unexpected” from the brainstem’s perspective. The resulting suite of sensations, including the startle, will be pretty different from how self-generated motor actions feel, and so it will be conceptualized differently, perhaps as a “delusion of control”.
That’s just one example. The same idea works equally well if I replace “command to move my arm” with “command to do a certain inner speech act”, in which case the result is an auditory hallucination. Or it could be a “command to visually imagine something”, in which case the result is a visual hallucination. Or it could be some visceromotor signal that causes physiological arousal, perhaps leading to a delusion of reference, and so on.
So, I dunno, imagine that cortex area 1 is a visceromotor area saying “something profoundly important is happening right now!” for some random reason, and independently, cortex area 2 is saying “who am I?”, and independently, cortex area 3 is saying “Napoleon”. All three of these things are happening independently and unrelatedly. But because of cortex area 1, there’s strong physiological arousal that sweeps through the brain and locks in this configuration within the hippocampus as a strong memory that “feels true” going forward.
That’s probably not correct in full detail, but my guess is that it’s something kinda like that.
I’d bet that Noam Brown’s TED AI talk has a lot of overlap with this one that he gave in May. So you don’t have to talk about it second-hand, you can hear it straight from the source. :) In particular, the “100,000×” poker scale-up claim is right near the beginning, around 6 minutes in.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters.
Yup, this is what we’re used to today:
- there’s an information repository,
- there’s a learning algorithm that updates the information repository,
- there’s an inference algorithm that queries the information repository,
- both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters,
- the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters,
- the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while.
So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.
That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.
So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.
By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don't know) of how to make the world model not contain dangerous minds.
There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?
Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.
I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.
When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.
[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of "human understandable structure". There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven't really looked a lot at them).
“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.
Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).
“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.
I'm not brainstorming on "how could this system fail". Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.
Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.
By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.
I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.
Huh, funny you think that. From my perspective, “modeling how other people model me” is not relevant to this post. I don’t see anywhere that I even mentioned it. It hardly comes up anywhere else in the series either.
John's post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it's harder to understand for a human than the obvious modular arithmetic implementation in python.
Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable.
Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
The goal is to understand how intelligence works. Clearly that would be very useful for alignment?
If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly.
If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true.
If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all.
For more on this topic, see my post “Endgame safety” for AGI.
E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That's of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model.
There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.).
Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the very same algorithm will learn that there’s such a thing as humans. And if an algorithm can autonomously figure out how an engine works, the very same algorithm can autonomously figure out human psychology.
You could remove humans from the training data, but that leads to its own problems, and anyway, you don’t need to “understand intelligence” to recognize that as a possibility (e.g. here’s a link to some prior discussion of that).
Or you could try to “find” humans and human manipulation in the world-model, but then we have interpretability challenges.
Or you could assume that “humans” were manually put into the world-model as a separate module, but then we have the problem that world-models need to be learned from unlabeled data for practical reasons, and humans could also show up in the other modules.
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
I’m 100% confident that, whatever AGI winds up looking like, “we could just make it dumber” will be on the table as an option. We can give it less time to find a solution to a problem, and then the solution it finds (if any) will be worse. We can give it less information to go on. Etc.
You don’t have to “understand intelligence” to recognize that we’ll have options like that. It’s obvious. That fact doesn’t come up very often in conversation because it’s not all that useful for getting to Safe and Beneficial AGI.
Again, if you assume the world model is a Bayes net (or use OpenCog AtomSpace, or Soar), I think you can do all the alignment thinking and brainstorming that you want to do, without doing new capabilities research. And I think you’d be more likely (well, less unlikely) to succeed anyway.
This post is about science. How can we think about psychology and neuroscience in a clear and correct way? “What’s really going on” in the brain and mind?
By contrast, nothing in this post (or the rest of this series), is practical advice about how to be mentally healthy, or how to carry on a conversation, etc. (Related: §1.3.3.)
Does that help? Sorry if that was unclear.
See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan.
Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction:
- Maybe LLMs will plateau anyway, so the comparison between inscrutable versus even-more-inscrutable is a moot point. And then you’re just doing AGI capabilities research for no safety benefit at all. (See “Endgame safety” for AGI.)
- LLMs at least arguably have some safety benefits related to reliance on human knowledge, human concepts, and chains-of-thought, whereas the kind of AGI you’re trying to invent might not have those.
- Your approach would (if “successful”) be much, much more compute-efficient—probably by orders of magnitude—see Section 3 here for a detailed explanation of why. This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder. (Related: I for one think AGI is possible on a single consumer GPU, see here.)
- Likewise, your approach would (if “successful”) have a “better” inductive bias, “better” sample efficiency, etc., because you’re constraining the search space. That suggests fast takeoff and less likelihood of a long duration of janky mediocre-human-level AGIs. I think most people would see that as net bad for safety.
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don't think it is a case against the project.
If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capabilities research! Yes! It means we should focus first on solving that problem, and only do AGI capabilities research when and if we succeed. And that’s what I believe. Right?
It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
I don’t think this is plausible. I think alignment properties are pretty unrelated to the low-level structure out of which a world-model is built. For example, the difference between “advising a human” versus “manipulating a human”, and the difference between “finding a great out-of-the-box solution” versus “reward hacking”, are both extremely important for alignment. But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
Anyway, if your suggestion is true, I claim you can (and should) figure that out without doing AGI capabilities research. Here’s an example. Assume that the the learned data structure is a Bayes net, or some generalization of a Bayes net, or the OpenCog “AtomSpace”, or whatever. OK, now spend as long as you like thinking about what if anything that has to do with “alignment properties”. My guess is “very little”. Or if you come up with anything, you can share it. That’s not advancing capabilities, because people already know that there is such a thing as Bayes nets / OpenCog / whatever.
Alternatively, another concrete thing that you can chew on is: brain-like AGI. :) We already know a lot about how it works without needing to do any new capabilities research. For example, you might start with Plan for mediocre alignment of brain-like [model-based RL] AGI and think about how to make that approach better / less bad.
I think Seth is distinguishing “aligning LLM agents” from “aligning LLMs”, and complaining that there’s insufficient work on the former, compared to the latter? I could be wrong.
I don't actually know what it means to work on LLM alignment over aligning other systems
Ooh, I can speak to this. I’m mostly focused on technical alignment for actor-critic model-based RL systems (a big category including MuZero and [I argue] human brains). And FWIW my experience is: there are tons of papers & posts on alignment that assume LLMs, and with rare exceptions I find them useless for the non-LLM algorithms that I’m thinking about.
As a typical example, I didn’t get anything useful out of Alignment Implications of LLM Successes: a Debate in One Act—it’s addressing a debate that I see as inapplicable to the types of AI algorithms that I’m thinking about. Ditto for the debate on chain-of-thought accuracy vs steganography and a zillion other things.
When we get outside technical alignment to things like “AI control”, governance, takeoff speed, timelines, etc., I find that the assumption of LLMs is likewise pervasive, load-bearing, and often unnoticed.
I complain about this from time to time, for example Section 4.2 here, and also briefly here (the bullets near the bottom after “Yeah some examples would be:”).
I didn’t read it very carefully but how would you respond to the dilemma:
- If the programmer has to write things like “tires are black” into the source code, then it’s totally impractical. (…pace davidad & Doug Lenat.)
- If the programmer doesn’t have to write things like “tires are black” into the source code, then presumably a learning algorithm is figuring out things like “tires are black” from unlabeled data. And then you’re going to wind up with some giant data structure full of things like “ENTITY 92852384 implies ENTITY 8593483 with probability 0.36”. And then we have an alignment problem because the AI’s goals will be defined in terms of these unlabeled entities which are hard to interpret, and where it’s hard to guess how they’ll generalize after reflection, distributional shifts, etc.
I’m guessing you’re in the second bullet but I’m not sure how you’re thinking about this alignment concern.
Yeah, I think the §3.3.1 pattern (intrinsic surprisingness) is narrower than the §3.3.4 pattern (intrinsic surprisingness but with an ability to make medium-term predictions).
But they tend to go together so much in practice (life experience) that when we see the former we generally kinda assume the latter. An exception might be, umm, a person spasming, or having a seizure? Or a drunkard wandering about randomly? Hmm, maybe those don’t count because there are still some desires, e.g. the drunkard wants to remain standing.
I agree that agency / life-force has a strong connotation of the §3.3.4 thing, not just the §3.3.1 thing. Or at least, it seems to have that connotation in my own intuitions. ¯\_(ツ)_/¯
Hmm, I still might not be following, but I’ll write something anyway. :)
Take some “concept” in your world-model, operationalized as a particular cluster C of neurons in some part of your cortex that tend to activate together.
How might we figure out what what C “means”?
One part of the answer is entirely within the cortex world-model: C has particular relationships to other things in the cortex world-model, which in term have relationships to still other things etc. Clusters of neurons related to “bird” have some connection to clusters of neurons related to “flying”. That by itself might already be enough to pin down the “meanings” of different things, just because there’s so much structure there, and we can try to match it up with structures in the world, by analogy with unsupervised machine translation. But if not…
The other part of the answer is about how the cortex world-model relates to the real world. Maybe C directly predicts some particular pattern in low-level sensory inputs. Maybe C directly activates some particular pattern in motor output. Or maybe the connection is less direct—a certain abstract pattern in the space of abstract patterns in the space of abstract patterns in the space of low-level sensory inputs, or whatever. If we look at naturalistic visual inputs that directly or indirectly trigger C, and they’re disproportionately pictures of clocks, then that’s some evidence that C “means” clock.
So, how about “cold”? Our body has a couple relevant sensors: peripheral nerves that express TRPM8 (“cold and menthol receptor 1”), hypothalamus neurons that detect blood temperature via TRPV1, etc. (I’m not an expert on the details.) As usual, these sensory signals are processed in two areas in parallel. In the hypothalamus & brainstem (“Steering Subsystem”), they trigger innate reactions like shivering, unpleasant feelings / desire to warm up, and so on. And in the cortex, they’re treated as just so many more channels of unlabeled input data that the world-model needs to predict.
In the course of predicting them well, the world-model invents some slightly-higher-level concept (or family of closely-interlinked concepts) that we call “cold”. And it notices and memorizes predictively-useful relationships between this new “cold” concept and other things in the world-model, e.g. shivering and ice.
I don’t think there’s more to the concept “cold” than the sum total of its associations with every other concept, with sensory input, and with motor output. And we can explain those latter associations via the structure of the world and body in conjunction with a learning algorithm running throughout your life experience.
You can sorta write code for a relevant part of what's happening in the mind when e.g. the freezing emotion/sensation is triggered.
I like to draw the distinction between understanding learning algorithms and understanding trained models. The former is kinda like what you learn in an ML course (gradient descent, training data, etc.) , the latter is kinda like what you learn in a mechanistic interpretability paper. I don’t think it’s realistic to “write code” for the “cold” concept, because I think it (like all concepts) emerges at the trained model level. It emerges from a learning algorithm, training environment, loss function, etc.
Of course, we can chat about the trained model level to some extent. Why is “cold” associated with shivering? Because in the training environment of life experience, those two things have tended to go together, such that each provides nonzero Bayesian evidence that the other should be active, or will be soon. Ditto with the connection between cold and ice cream, and everything else. So we can chat about it, but it would take forever to directly write code for all those things. Hence the learning algorithm. Does that help?
I disagree with “He seems to have no inside information.” He presented himself as having no inside information, but that’s presumably how he would have presented himself regardless of whether he had inside information or not. It’s not like he needed to convince others that he knows what he’s doing, like how in the stock market you want to buy then pump then sell. This is different—it’s a market that’s about to resolve. The smart play from his perspective would be to aggressively trash-talk his own competence, to lower the price in case he wants to buy more.
Possibly related: Could we use current AI methods to understand dolphins? + comments
Hmm, maybe we should distinguish two things:
- (A) I find the feeling of picking up the tofu with the fork to be intrinsically satisfying—it feels satisfying and empowering to feel the tongs of the fork slide into the tofu.
- (B) I don’t care at all about the feeling of the fork sliding into the tofu; instead I feel motivated to pick up tofu with the fork because I’m hungry and tofu is yummy.
For (A), the analogy to picking up feta is logically sound—this is legitimate evidence that picking up the feta will also feel intrinsically satisfying. And accordingly, my brain, having made the analogy, correctly feels motivated to pick up feta.
For (B), the analogy to picking up feta is irrelevant. The dimension along which I’m analogizing (how the fork slides in) is unrelated to the dimension which constitutes the source of my motivation (tofu being yummy). And accordingly, if I like the taste of tofu but dislike feta, then I will not feel motivated to pick up the feta, not even a little bit, let alone to the point where it’s determining my behavior.
The lesson here (I claim) is that our brain algorithms are sophisticated enough to not just note whether an analogy target has good or bad vibes, but rather whether the analogy target has good or bad vibes for reasons that legitimately transfer back to the real plan under consideration.
So circling back to empathy, if I was a sociopath, then “Ahmed getting punched” might still kinda remind me of “me getting punched”, but the reason I dislike “me getting punched” is because it’s painful, whereas “Ahmed getting punched” is not painful. So even if “me getting punched” momentarily popped into my sociopathic head, I would then immediately say to myself “ah, but that’s not something I need to worry about here”, and whistle a tune and carry on with my day.
Remember, empathy is a major force. People submit to torture and turn their lives upside down over feelings of empathy. If you want to talk about phenomena like “something unpleasant popped into my head momentarily, even if it doesn’t really have anything to do with this situation”, then OK maybe that kind of thing might have a nonzero impact on motivation, but even if it does, it’s gonna be tiny. It’s definitely not up to the task of explaining such a central part of human behavior, right?
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
[I learned the term teleosemantics from you! :) ]
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)
beliefs can have impacts on the world if the world looks at them
…Indeed, what I said above is just a special case. Here’s something more general and elegant. You have the core LI system, and then some watcher system W, which reads off some vector of internal variables V of the core LI system, and then W takes actions according to some function A(V).
After a while, the LI system will automatically catch onto what W is doing, and “learn” to interpret V as an expectation that A(V) is going to happen.
I think the central case is that W is part of the larger AI system, as above, leading to normal agent-like behavior (assuming some sensible system for resolving the underdetermination). But in theory W could also be humans peeking into the LI system and taking actions based on what they see. Fundamentally, these aren’t that different.
So whatever solution we come up with to resolve the underdetermination, whether human-brain-like “valence” or something else, that solution ought to work for the humans-peeking-into-the-LI situation just as it works for the normal W-is-part-of-the-larger-AI situation.
(But maybe weird things would happen before convergence. And also, if you don’t have any system at all to resolve the underdetermination, then probably the results would be weird and hard to reason about.)
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
- If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
- If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.
Thanks!
But I've heard that many people do a lot of thinking about negative outcomes, too.…
FWIW my answer is “involuntary attention” as discussed in Valence §3.3.5 (it also came up in §6.5.2.1 of this series).
If I look at my shoe and (voluntarily) pay attention to it, my subsequent thoughts are constrained to be somehow “about” my shoe. This constraint isn’t fully constraining—I might be putting my shoe into different contexts, or thinking about my shoe while humming a song to myself, etc.
By analogy, if I’m anxious, then my subsequent thoughts are (involuntarily) constrained to be somehow “about” the interoceptive feeling of anxiety. Again, this constraint isn’t fully constraining—I might be putting the feeling of anxiety into the context of how everyone hates me, or into the context of how my health is going downhill, or whatever else, and I could be doing both those things while simultaneously zipping up my coat and humming a song, etc.
Anxiety is just one example; I think there’s likewise involuntary attention associated with feeling itchy, feeling in pain, angry, etc.
No I don’t recommend reading this post anymore, it has some ideas with little kernels of truth but also lots of errors and confusions. ¯\_(ツ)_/¯
This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires.
If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility.
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans, it’s perfectly possible for a plan to seem desirable but not plausible, or for a plan to seem plausible but not desirable. I think there are very good reasons that our brains are set up that way.
My 9yo has recently enjoyed Ender’s Game, Harry Potter, Hitchhiker’s Guide to the Galaxy, and What If. He recently asked to borrow my The Vital Question (it came up in conversation about abiogenesis) and he’s mostly following it so far but has occasional questions for me, we’ll see how far he gets or if he loses steam.
For non-books, he wanted to do Khan academy cosmology / astronomy, I think he did one big unit of Khan academy math before losing interest, he likes Eureka crates (little kits to build your own soap dispenser, rivet press, ukulele, whatever, they come once a month, good gift), lotsa video games, and he was doing DuoLingo Spanish every night (he has a streak, he’s a total sucker for gamification) but to my dismay decided to switch to the rather less practical DuoLingo Klingon. ¯\_(ツ)_/¯
Hmm. I don’t think I’m invoking any mysterious answers. I think I’m suggesting a nuts-and-bolts model—a particular prediction about the behavior of a particular kind of algorithm given a particular type of input data. I’m trying to figure out why you disagree.
Like IMO it's important to recognize that saying "inherent-surprisingness/vitalistic-force my mind paints on objects explains my sense of animals having life-force" is not actually a mechanistic hypothesis -- I would not advance-predict a sense of life-force from thinking that minds project their continuous surprise about an object as a property on the object itself. Not sure whether you're making this mistake though.
Again I think it’s a mechanistic hypothesis. Let me walk through it in more detail; see where you disagree:
- Any concept or property in your conscious experience is a piece (latent variable or whatever) in a generative model built by a predictive (self-supervised) learning algorithm on sensory data.
- Some of that sensory data is interoceptive, including things like sense of one’s own physiological arousal, temperature, confusion, valence (goodness / badness), physical attraction, etc.
- The “mind projection fallacy” applies to these interoceptive sensations (§3.3.2). Why? Because the learning algorithm is finding generative models that predict sensory data, and mind-projection-fallacy generative models are simple and effective at predicting interoceptive sensory data. For example, whenever I look at the shirt, I reliably get white-derived visual sensations, therefore I wind up with a generative model that says that there’s a shirt in the world, and it’s white. Likewise, whenever I think about capitalism, I reliably get an interoceptive sensation of negative valence, therefore I wind up with a generative model that says that there’s a thing “capitalism” in the world, and that thing is “bad”.
- Every interoceptive sensation spawns a mind-projection-fallacy conscious concept / property that applies to things in the outside world. And surprise is one such sensation. So a priori we strongly expect every adult human to feel like there’s a surprise-derived intuitive property of things in the world. (But I haven’t yet said which intuitive property it is.)
- Meanwhile, in our everyday experience, we all have an intuitive sense of animation / agency. I think the word “vitalistic force” is a good way to point to this recognizable intuition.
- And then my substantive claim is that the previous two bullet should be equated: the surprise-derived intuitive property in adult humans is the intuitive sense of animation / agency.
Alternatively, suppose we didn’t have our subjective experience, but were told that there exists predictive learning algorithms blah blah as in Post 1. We should predict that these algorithms will build generative models containing a surprise-derived property of things in the world. And then we could look around the “training environment” (human world), try to figure out what would generate surprise (things that are both unpredictable an un-ignorable), and we’d predict that this intuitive property would get painted first and foremost onto things that are alive, but also onto cartoon characters and so on, and also to certain self-reflective things (i.e., aspects of the brain algorithm itself). When we do this kind of analysis well, we’ll wind up describing every aspect of our actual everyday intuitions around animation / agency / alive-ness, and predicting all the items in §3.3. But we’d be doing all that purely from first-principles reasoning about algorithms and biology. And then that “prediction” would be “tested” by noticing that humans have exactly those intuitions. As it happens, it’s not really a “prediction” because we already know what intuitions are typical in human adults. But nevertheless I think the reasoning is sound and tight and locally-valid, not just special pleading because we already know the answer. See what I mean?
I think when humans model other minds (which includes animals (and gods)) they start from a pre-built template (potentially from mirroring part of their own cognitive machinery) with properties goals/desires, emotions, memory and beliefs.
I think that when an average person sees a cockroach running across the floor, they think of it as having goals but probably not emotions or memories or beliefs. As a scientific matter, cockroaches do have memories, but I think at least some people feel kinda surprised and impressed when they see a cockroach doing something that demonstrates memory, which suggests that their intuitive model did not already include cockroach memory. But everyone thinks of the cockroach as being alive / animate, and also, nobody would be surprised or impressed to see a cockroach demonstrate “wanting” / goal-seeking by going around a trivial barrier to get into a hiding place.
That goes well with my theory that “vitalistic force” (derived from surprise) and “wanting” (derived from a pattern where I can make medium-term predictions despite short-term surprise) are two widely-used core intuitions in our generative model space, which strongly tend to go together. And then other aspects of modeling minds are optional add-ons. (Just like “has frost on it” is an optional add-on to an object being “cold”.)
Also, I think you're aware of this, but nothing is inherently meaningful; meaning can only arise through how something is relative to something else. In the cold case (where I assume you talk about mental-physiological reactions to freezing/feeling-cold (as opposed to modelling the temperature of objects)), the meaning of "cold" comes from the cluster of sensations it refers to and how it affects considerations. If you just had the information "type-ABC (aka 'cold') sensors fired at position-XYZ", the rest of the mind wouldn't need to know what to do with that information on it's own but it needs some circutry to relate the information to other events. So I wouldn't say what you wrote explains cold, but maybe you didn't think it did.
My claim is: there’s a predictive learning algorithm that sculpts generative models that can explain incoming sensory data. (See Post 1.) When I look at a clock, the sensory data involves retinal cells firing, while the generative model involves the concept “clock” (among other things).
The concept “cold”, like “clock”, is a concept in our intuitive models. This is “meaningful” in the same way any other intuitive concept is meaningful. It fits into our web-of-knowledge / world-model / “map” / generative model space, it has relations to other concepts, it helps make sense of the world, etc.
If an adult has a concept in their intuitive models, then that concept must be doing some work: it must be directly or indirectly helping to predict some kind of sensory input data. Otherwise it would not be in the generative models in the first place—that’s how the predictive learning algorithm works. For example, the concept “clock” is doing lots of work in different contexts, including helping explain visual input data when I happen to be looking at a clock. Thus we can ask by analogy: what’s the concept “cold” doing? The obvious answer is: the concept “cold” is mainly helping explain sensory input data involving the signals coming from blah blah type of thermoreceptor in the peripheral nervous system.
The point I was making before was that the concept “cold” starts from that important role. But by adulthood it winds up being invoked by analogy in things like “cold comfort”, and getting all these other connotations that are not superficially related to predicting the sensory signals coming from blah blah type of thermoreceptor. …But nevertheless, I think it’s fair to say that the central role of the “cold” concept, even in adults, is to enable generative models to correctly predict (many of) the signals coming from blah blah type of thermoreceptor.
And in a similar way, I’m claiming that the central role of the intuitive “vitalistic force” / “animation” concept is to enable generative models to correctly predict many of the sensory signals coming from the interoceptive sensation of surprise. (But it’s still true that this concept winds up with other connotations and extensions-by-analogy too.)
Does that help? Thanks for patient engagement and feedback.
Here are all of my interactions with claude related to writing blog posts or comments in the last four days:
- I asked Claude for a couple back-of-the-envelope power output estimations (running, and scratching one’s nose). I double-checked the results for myself before alluding to them in the (upcoming) post. Claude’s suggestions were generally in the right ballpark, but more importantly Claude helpfully reminded me that metabolic power consumption = mechanical power + heat production, and that I should be clear on which one I mean.
- “There are two unrelated senses of "energy conservation", one being physics, the other being "I want to conserve my energy for later". Is there some different term I can use for the latter?” — Claude had a couple good suggestions; I think I wound up going with “energy preservation”.
- “how many centimeters separate the preoptic nucleus of the hypothalamus from the arcuate nucleus?” — Claude didn’t really know but its ballpark number was consistent with what I would have guessed. I think I also googled, and then just to be safe I worded the claim in a pretty vague way. It didn’t really matter much for my larger point in even that one sentence, let alone for the important points in the whole (upcoming) post.
- “what's a typical amount that a 4yo can pick up? what about a national champion weightlifter? I'm interested in the ratio.” — Claude gave an answer and showed its work. Seemed plausible. I was writing this comment, and after reading Claude’s guess I changed a number from “500” to “50”.
- “Are there characteristic auditory properties that distinguish the sound of someone talking to me while facing me, versus talking to me while facing a different direction?” — Claude said some things that were marginally helpful. I didn’t wind up saying anything about that in the (upcoming) post.
- “what does "receiving eye contact" mean?” — I was trying to figure out if readers would understand what I mean if I wrote that in my (upcoming) post. I thought it was a standard term but had a niggling worry that I had made it up. Claude got the right answer, so I felt marginally more comfortable using that phrase without defining it.
- “what's the name for the psychotic delusion where you're surprised by motor actions?” — I had a particular thing in mind, but was blanking on the exact word. Claude was pretty confused but after a couple tries it mentioned “delusion of control”, which is what I wanted. (I googled that term afterwards.)
When we develop mechanisms to control AI systems, we are essentially creating tools that could be used by any sufficiently powerful entity - whether that's a government, corporation, or other organization. The very features that make an AI system "safe" in terms of human control could make it a more effective instrument of power consolidation.
…And if we fail to develop such mechanisms, AI systems will still be an “instrument of power consolidation”, but the power being consolidated will be the AI’s own power, right?
I mean, 90% of this article—the discussion of offense-defense balance, and limits on human power and coordination—applies equally to “humans using AI to get power” versus “AI getting power for its own purposes”, right?
E.g. out-of-control misaligned AI is still an “enabler of coherent entities”, because it can coordinate with copies of itself.
I guess you’re not explicitly arguing against “open publication of safety advances” but just raising a point of consideration? Anyway, a more balanced discussion of the pros and cons of “open publication of safety advances” would include:
- Is “humans using AI to get power” less bad versus more bad than “AI getting power for its own purposes”? (I lean towards “probably less bad but it sure depends on the humans and the AI”)
- If AI obedience is an unsolved technical problem to such-and-such degree, to what extent does that lead to people not developing ever-more-powerful AI anyway? (I lean towards “not much”, cf. Meta / LeCun today, or the entire history of AI)
- Is the sentence “in reality we should expect combined human-AI entities to reach dangerous capabilities before pure artificial intelligence” really true, and if so how much earlier and does it matter? (I lean towards “not necessarily true in the first place, and if true, probably not by much, and it’s not all that important”)
It’s probably a question that needs to be considered on a case-by-case basis anyway. ¯\_(ツ)_/¯
Against hard barriers of this kind, you can point to arguments like “positing hard barriers of this kind requires saying that there are some very small differences in intelligence that make the crucial difference between being able vs. unable to do the task in principle. Otherwise, e.g., if a sufficient number of IQ 100 agents with sufficient time can do anything that an IQ 101 agent can do, and a sufficient number of IQ 101 agents with sufficient time can do anything an IQ 102 agent can do, etc, then by transitivity you end up saying that a sufficient number of IQ 100 agents with sufficient time can do anything an IQ 1000 agent can do. So to block this sort of transition, there needs to be at least one specific point where the relevant transition gets blocked, such that e.g. there is something that an IQ X agent can do that no number of IQ X-minus-epsilon agent cannot. And can epsilon really make that much of a difference?”
Here’s an analogy, maybe. A sufficient number of 4yo’s could pick up any weight that a 5yo could pick up; a sufficient number of 5yo’s could pick up any weight that a 6yo could pick up … a sufficient number of national-champion weightlifters could pick up any weight that a world-record weightlifter could pick up.
So does it follow that a sufficient number of 4yo’s can pick up any weight that a world-record weightlifter could pick up? No! The problem is, the weight isn’t very big. So you can’t get a group of 50 4yo’s to simultaneously contribute to picking it up. There’s just no room for them to all hold onto it.
So here’s a model. There are nonzero returns to more agents working together to do a task, if they can all be usefully employed. But there are also rapidly-increasing coordination costs, and/or limitations to one’s ability to split a task into subtasks.
In the human world, you can’t notice a connection between two aspects of a problem unless those two aspects are simultaneously in a single person’s head. Thus, for hard problems, you can split them up a bit, with skill and luck, but not too much, and it generally requires that the people working on the subproblems have heavily-overlapping understandings of what’s going on (or that the manager who split up the problem in the first place has a really solid understanding of both subproblems such that they can be confident that it’s a clean split). See also: interfaces as scarce resources.
Thanks!
I don’t think S(A) or any other thought bursts into consciousness from the void via an acausal act of free will—that was the point of §3.3.6. I also don’t think that people’s self-reports about what was going on in their heads in the immediate past should necessarily be taken at face value—that was the point of §2.3.
Every thought (including S(A)) begins its life as a little seed of activation pattern in some little part of the cortex, which gets gradually stronger and more widespread across the global workspace over the course of a fraction of a second. If that process gets cut off prematurely, then we don’t become aware of that thought at all, although sometimes we can notice its footprints via an appropriate attention-control query.
Does that help?
Maybe you’re thinking that, if I assert that a positive-valence S(A) caused A to happen, then I must believe that there’s nothing upstream that in turn caused S(A) to appear and to have positive valence? If so, that seems pretty silly to me. That would be basically the position that nothing can ever cause anything, right?
(“Your Honor, the victim’s death was not caused by my client shooting him! Rather, The Big Bang is the common cause of both the shooting and the death!” :-D )
At least this has helped clarify that you think of S(A) to (often) precede A by a lot, which wasn't clear to me.
Not really; instead, I think throwing the ball is a time-extended course of action, as most actions are. If I “decide” to say a sentence or sing a song, I don’t separately “decide” to say the next syllable, then “decide” to say the next syllable after that, etc.
What do you make of the Libet experiments?
He did a bunch of experiments, I’m not sure which ones you’re referring to. (The “conscious intentions” one?) The ones I’ve read about seem mildly interesting. I don’t think they contradict anything I wrote or believe. If you do think that, feel free to explain. :)
We’re definitely talking past each other somehow. For example, your statement “The S(A) is precisely timed to coincide with the release” is (to me) obviously false. In the case of “deciding to throw a ball”, A would be the time-extended action of throwing the ball, and S(A) would be me “making a decision of my free will” to throw the ball, which happens way before the release, indeed it happens before I even start moving my arm. Releasing the ball isn’t a separate “decision” but rather part of the already-decided course-of-action.
(Again, I’m definitely not arguing that every action is this kind of stereotypical [S(A); A] “intentional free will decision”, or even that most actions are. Non-examples include every action you take in a flow state, and indeed you could say that every day is full of little “micro-flow-states” that last for even just a few seconds when you’re doing something rather than self-reflecting.)
…Then after the fact, I might recall the fact that I released the ball at such-and-such moment. But that thought is not actually about an “action” for reasons discussed in §2.6.1.
I agree that there is such a thing as two things occurring in sequence where the first doesn’t cause the second. But I don’t think this is one of those cases. Instead, I think there are strong reasons to believe that if S(A) is active and has positive valence, then that causally contributes to A tending to happen afterwards.
For example, if A = stepping into the ice-cold shower, then the object-level idea of A is probably generally negative-valence—it will feel unpleasant. But then S(A) is the self-reflective idea of myself stepping into the shower, and relatedly how stepping into the shower fits into my self-image and the narrative of my life etc., and so S(A) is positive valence.
I won’t necessarily wind up stepping into the shower (maybe I’ll chicken out), but if I do, then the main reason why I do is the fact that the S(A) thought was active in my mind immediately beforehand, and had positive valence. Right?
Hmm. I think you’re understating the tendency of most people to follow prevailing norms, and yet your main conclusion is partly right. I think there are interesting dynamics happening at two levels simultaneously—the level of individual decisions, and the level of cultural evolution—and your comment is kinda conflating those levels.
So here’s how I would put things:
- Most people care very very strongly about doing things that would look good in the eyes of the people they respect. They don’t think of it that way, though—it doesn’t feel like that’s what they’re doing, and indeed they would be offended by that suggestion. Instead, those things just feel like the right and appropriate things to do. This is related to and upstream of norm-following. This is an innate drive, part of human nature built into our brain by evolution.
- Also, most people also have various other innate drives that lead to them feeling motivated to eat when hungry, to avoid pain, to bond with friends, for parents to love their children and adolescents to disrespect their parents (but respect their slightly-older friends), and much else.
- (But there’s person-to-person variation, and in particular some small fraction of people are sociopaths who just don’t feel intrinsically motivated by (1) at all.)
- The norms of (1) can be totally arbitrary. If the people I respect think that genocide is bad, then probably so do I. If they think genocide is awesome, then probably so do I. If they think it’s super-cool to hop backwards on one foot, then probably so do I.
- …But (2) provides a constant force gently pushing norms towards behavioral patterns that match up with innate tendencies in (2). So we tend to wind up with cultural norms that line up with avoiding pain, eating-when-hungry, bonding with friends, and so on.
- …But not perfectly, because there are other forces acting on norms too, such as game-theoretic signaling equilibria or whatever. These enable the existence of widespread norms with aspects that run counter to aspects of (2)—think of religious fasting, initiation rites, etc.
- When (4),(5),(6) play out in some group or society, some norms will “win” over others, and the norms that “win” are probably (to some extent) a priori predictable from structural aspects of the situation—homogeneity, mobility, technology, whatever.
Seconding quetzal_rainbow’s comment. Another way to put it is:
- If your reference class is “integrating a new technology into the economy”, then you’d expect AI integration to unfold over decades.
- …But if your reference class is “integrating a new immigrant human into the economy—a human who is already generally educated, acculturated, entrepreneurial, etc.”, then you’d expect AI integration to unfold over years, months, even weeks. There’s still on-the-job training and so on, for sure, but we expect the immigrant human to take the initiative to figure out for themselves where the opportunities are and how to exploit them.
We don’t have AI that can do the latter yet, and I for one think that we’re still a paradigm-shift away from it. But I do expect the development of such AI to look like “people find a new type of learning algorithm” as opposed to “many many people find many many new algorithms for different niches”. After all, again, think of humans. Evolution did not design farmer-humans, and separately design truck-driver-humans, and separately design architect-humans, etc. Instead, evolution designed one human brain, and damn, look at all the different things that that one algorithm can figure out how to do (over time and in collaboration with many other instantiations of the same algorithm etc.).
How soon can we expect this new paradigm-shifting type of learning algorithm? I don’t know. But paradigm shifts in AI can be frighteningly fast. Like, go back a mere 12 years ago, and the entirety of deep learning was a backwater. See my tweet here for more fun examples.
Hmm. Maybe here’s an analogy. Suppose somebody said:
There’s a certain kind of interoceptive sensory input, consisting of such-and-such signal coming from blah type of thermoreceptor in the peripheral nervous system. Your brain does its usual thing of transforming that sensation into its own “color” of “metaphysical paint” (as in §3.3.2) that forms a concept / property in your conscious awareness and world-model, and you know it by the everyday term “cold”.
On the one hand, I would defend this passage as basically true. On the other hand, there’s clearly a lot of connotations and associations of the word “cold” that go way beyond the natural generalization of things that trigger this thermoreceptor. “Concepts are clusters in thingspace”, as the saying goes, and thus things that go along with coldness often enough kinda get roped in as a connation or aspect of the coldness concept itself. And then all those aspects of coldness can in turn get analogized into other domains, and now here we are talking about cold personalities and cold starts and cold cases and cold symptoms and the Cold War and on and on.
By the same token, I’m happy to defend a claim along the lines of “intrinsic unpredictability is the seed / core at the center of concepts like animation, vitality, agency, etc.”, but I acknowledge that intrinsic unpredictability in and of itself is not the entirety of those terms and their various connotations and associations.
(This is a helpful discussion for me, thanks.)
The brain needs to observe something (sense, interoception) from which it can infer this. The pattern in what observations would that be?
(partly copying from my other comment) For example, consider the following fact.
FACT: Sometimes, I’m thinking about pencils. Other times, I’m not thinking about pencils.
Now imagine that there’s a predictive (a.k.a. self-supervised) learning algorithm which is tasked with predicting upcoming sensory inputs, by building generative models. The above fact is very important! If the predictive learning algorithm does not somehow incorporate that fact into its generative models, then those generative models will be worse at making predictions. For example, if I’m thinking about pencils, then I’m likelier to talk about pencils, and look at pencils, and grab a pencil, etc., compared to if I’m not thinking about pencils. So the predictive learning algorithm is incentivized (by its predictive loss function) to build a generative model that can represent the fact that any given concept might be active in the cortex at a certain time, or might not be.
See also §1.4.
That's why I have long thought that there has to be a feedback from the current thought back as input signal (thoughts as observations). Such a connection is not present in the brain-like model, but it might not be the only way. Another way would be via memory.
I mean yeah obviously the cortex has various types of memory, and this fact is important for all kinds of things. :)
Clearly, action A can happen without S(A) being present. In fact, actions are often more effectively executed if you don't think too hard about them[citation needed]. An S(A) is not required. Maybe S(A) and A cooccur often, but that doesn't imply causality.
These sentences seem to suggest that either A’s are either always, or never, caused by a preceding S(A), and out of those two options, “never” is more plausible. But that’s a false dichotomy. I propose that sometimes they are and sometimes they aren’t caused by S(A).
By analogy, sometimes doors open because somebody pushed on them, and sometimes doors open without anyone pushing on them. Also, it’s possible for there to be a very windy day where the door would open with 30% probability in the absence of a person pushing on it, but opens with 85% probability if somebody does push on it. In that case, did the person “cause” the door to open? I would say yeah, they “partially caused it” to open, or “causally contributed to” the door opening, or “often cause the door to open”, or something like that. I stand by my claim that the self-reflective S(standing up), if sufficiently motivating, can “cause” me to then stand up, in that sense.
The fact that “actions are often more effectively executed if you don't think too hard about them” is referring to the fact that if you have a learned skill, in the form of some optimized context-dependent temporal sequence of motor-control and attention-control commands, then self-reflective thoughts can interrupt and thus mess up that temporal sequence, just as people shouting random numbers can disrupt someone trying to count, or how you can’t sing two songs in your head simultaneously. A.k.a. the limited capacity of cortex processing. Whereas that section is more about whether some course-of-action (like saying something, wiggling your fingers, standing up, etc.) starts or not.
Flow states (post 4) are a great example of A’s happening without any S(A).
One common question is whether the self-modeling task, which involves predicting a layer's own activations, would cause the network to merely learn the identity function. Intuitively, this might seem like an optimal outcome for minimizing the self-modeling loss.
I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?
[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first (M–N) entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]
As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.
In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.
In “activation regularization”, the auxiliary loss term is just , whereas in your study it’s (where is the layer L activation vector and is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.
I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.
(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)
Sorry if this is all stupid, or in the paper somewhere.
Honest question: Suppose that my friends and other people whom I like and respect and trust all believe that genocide is very bad. I find myself (subconsciously) motivated to fit in with them, and I wind up adopting their belief that genocide is very bad. And then I take corresponding actions, by writing letters to politicians urging military intervention in Myanmar.
In your view, would that count as “selfish” because I “selfishly” benefit from ideologically fitting in with my friends and trusted leaders? Or would it count as “altruistic” because I am now moved by the suffering of some ethnic group across the world that I’ve never met and can’t even pronounce?
I’m not an expert and I’m not sure it matters much for your point, but: Yes there were surely important synergies between NASA activities and the military ballistic missile programs in the 1960s, but I don’t think it’s correct to suggest that most NASA activities was stuff that would have to be done for the ballistic missile program anyway. It might actually be a pretty small fraction. For example, less than half the Apollo budget was for launch vehicles; they spent a similar amount on spacecraft, which are not particularly transferable to nukes. And even for the launch vehicles, it seems that NASA tended to start with existing military rocket designs and modify them, rather than the other way around.
I would guess that the main synergy was more indirect: helping improve the consistency of work, economies of scale, defraying overhead costs, etc., for the personnel and contractors and so on.
Why is, according to your model, the valence of self-reflective thoughts sorta the valence our "best"/pro-social selves would ascribe?
That would be §2.5.1. The idea is that, in general, there are lots of kinds of self-reflective thoughts: thoughts that involve me, and what I’m doing, and what I’m thinking about, and how my day is going, and whether I’m following through with my new years resolution, and what other people would think of me right now, and so on.
These all tend to have salient associations with each other. If I’m thinking about how my day is going, it might remind me that I had promised myself to exercise every day, which might remind me that Sally called me fat, and so on.
Whereas non-self-reflective thoughts by and large have less relation to that whole cloud of associations. If I’m engrossed in a movie and thinking about how the prince is fighting a dragon in a river, or even if I’m just thinking about how best to chop this watermelon, then I’m not thinking about any of those self-reflective things in the above paragraph, and am unlikely to for at least the next second or two.
Incidentally, I think your description is an overstatement. My claim is that “the valence our "best"/pro-social selves would ascribe” is very relevant to the valence of self-reflective thoughts, to a much greater extent than non-self-reflective thoughts. But they’re not decisive. That’s what I was suggesting by my §2.5.2 example of “Screw being ‘my best self’, I’m tired, I’m going to sleep”. The reason that they’re very relevant is those salient associations I just mentioned. If I self-reflect on what I’m thinking about, then that kinda reminds me of how what I’m thinking about reflects on myself in general; so if the latter seems really good and motivating, then some of that goodness will splash onto the former too.
Do you buy that? Sorry if I’m misunderstanding.
Why does the homunculus get modeled as wanting pro-social/best-self stuff (as opposed to just what overall valence would imply)?
Again, I think this is an overstatement, per the §2.5.2 example of “Screw being ‘my best self’, I’m tired, I’m going to sleep”. But it’s certainly directionally true, and I was talking about that in §3.5.1. I think the actual rule is that, if planning / brainstorming is happening towards some goal G, then we imagine that “the homunculus wants G”, since in general the planning / brainstorming process in general pattern-matches to “wanting” (i.e., we can predict what will probably wind up happening without knowing how).
So that moves us to the question: “if planning / brainstorming is happening towards some goal G, then why do we conclude that S(G) is positive valence, rather than concluding that G is positive valence?” For one thing, if G is negative-valence but S(G) is positive-valence, then we’ll still do the planning / brainstorming, we just focus our attention on S(G) rather than G during that process. That’s my example above of “I really wanted and intended to step into the ice-cold shower, but when I got there, man, I just couldn’t.” Relatedly, if the brainstorming process involves self-reflective thoughts, then that enables better brainstorming, for example involving attention-control strategies, making deals with yourself, etc. (more in Post 8). And another part of the answer is the refrigerator-light illusion, as mentioned in §3.5.1 (and see also the edge-case of “impulsive planning” in §3.5.2).
Does that help?
I'd guess that there was evolutionary pressure for a self-model/homunculus to seem more pro-social as the overall behavior (and thoughts) of the human might imply, so I guess there might be some particular programming from evolution into that direction. I don't know how exactly it might look like though. I also wouldn't be shocked if it's mostly just like all the non-myopic desires are pretty pro-social and the self-model's values get straightened out in a way the myopic desires end up dropped because that would be incoherent. Would be interested in hearing your model on my questions above.
This is a nitpick, but I think you’re using the word “pro-social” when you mean something more like “doing socially-endorsed things”. For example, If a bully is beating up a nerd, he’s impressing his (bully) friends, and he’s acting from social motivations, and he’s taking pride in his work, and he’s improving his self-image and popularity, but most people wouldn’t call bullying “pro-social behavior”, right?
Anyway, I think there’s an innate drive to impress the people who you like in turn. I’ve been calling it the drive to feel liked / admired. It is certainly there for evolutionary reasons, and I think that it’s very strong (in most people, definitely not everyone), and causes a substantial share of ego-syntonic desires, without people realizing it. It has strong self-reflective associations, in that “what the people I like would think of me” centrally involves “me” and what I’m doing, both right now and in general. It’s sufficiently strong that there tends to be a lot of overlap between “the version of myself that I would want others to see, especially whom I respect in turn” versus “the version of myself that I like best all things considered”.
I think that’s similar to what you’re talking about, right?
I think “vitalistic force” is a better term for describing what it intuitively seems to be, and “inherent unpredictability” is a better term for describing what’s happening under the hood. In this case I thought the former was a better label.
For example, last month, I had watch the vet put down my pet dog. His transition from living to corpse was fast and stark. If you ask me to describe what happened, I would say “my dog’s animation / agency / vitality / life-force / whatever seemed to evaporate away”, or something like that. I certainly wouldn’t say “well, 10 seconds ago there seemed to be inherent unpredictability in this body, and now it seems like there isn’t”. ¯\_(ツ)_/¯
Still, I appreciate the comment, I’ll keep it in mind in case I think of some way to make things clearer.
I definitely think that the human brain has innate evolved mechanisms related to social behavior in general, and to caring about (certain) other people’s welfare in particular.
And I agree that the evolutionary pressure explaining why those mechanisms exist are generally related to the kinds of things that Robert Trivers and other evolutionary psychologists talk about.
This post isn’t about that. Instead it’s about what those evolved mechanisms are, i.e. how they work in the brain.
Does that help?
…But I do want to push back against a strain of thought within evolutionary psychology where they say “there was an evolutionary pressure for the human brain to do X, and therefore the human brain does X”. I think this fails to appreciate the nature of the constraints that the brain operates under. There can be evolutionary pressure for the brain to do something, but there’s no way for the brain to do it, so it doesn’t happen, or the brain does something kinda like that but with incidental side-effects or whatever.
As an example, imagine if I said: “Here’s the source code for training an image-classifier ConvNet from random initialization using uncontrolled external training data. Can you please edit this source code so that the trained model winds up confused about the shape of Toyota Camry tires specifically?” The answer is: “Nope. Sorry. There is no possible edit I can make to this PyTorch source code such that that will happen.” You see what I mean? I think this kind of thing happens in the brain a lot. I talk about it more specifically here. More of my opinions about evolutionary psychology here and here.
Sorry for oversimplifying your views, thanks for clarifying. :)
Here’s a part I especially disagree with:
Over time, in societies with well-functioning social and legal systems, most people learn that hurting other people doesn't actually help them selfishly. This causes them to adopt a general presumption against committing violence, theft, and other anti-social acts themselves, as a general principle. This general principle seems to be internalized in most people's minds as not merely "it is not in your selfish interest to hurt other people" but rather "it is morally wrong to hurt other people". In other words, people internalize their presumption as a moral principle, rather than as a purely practical principle. This is what prevents people from stabbing each other in the backs immediately once the environment changes.
Just to be clear, I imagine we’ll both agree that if some behavior is always a good idea, it can turn into an unthinking habit. For example, today I didn’t take all the cash out of my wallet and shred it—not because I considered that idea and decided that it’s a bad idea, but rather because it never crossed my mind to do that in the first place. Ditto with my (non)-decision to not plan a coup this morning. But that’s very fragile (it relies on ideas not crossing my mind), and different from what you’re talking about.
My belief is: Neurotypical people have an innate drive to notice, internalize, endorse, and take pride in following social norms, especially behaviors that they imagine would impress the people whom they like and admire in turn. (And I have ideas about how this works in the brain! I think it’s mainly related to what I call the “drive to be liked / admired”, general discussion here, more neuroscience details coming soon I hope.)
The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you. It’s usually a good idea to wear shoes that are comfortable, rather than too small. It’s usually a good idea to use a bookmark instead of losing your place every time you put your book down. It’s usually a good idea to sleep on your bed instead of on the floor next to it. Etc. But we just think of all those things as good ideas, not moral rules; and relatedly, if the situation changes such that those things become bad ideas after all for whatever reason, we’ll immediately stop doing them with no hesitation. (If this particular book is too fragile for me to use a bookmark, then that’s fine, I won’t use a bookmark, no worries!)
those moral principles encode facts about what type of conduct happens to be useful in the real world for achieving our largely selfish objectives
I’m not sure what “largely” means here. I hope we can agree that our objectives are selfish in some ways and unselfish in other ways.
Parents generally like their children, above and beyond the fact that their children might give them yummy food and shelter in old age. People generally form friendships, and want their friends to not get tortured, above and beyond the fact that having their friends not get tortured could lead to more yummy food and shelter later on. Etc. I do really think both of those examples centrally involve evolved innate drives. If we have innate drives to eat yummy food and avoid pain, why can’t we also have innate drives to care for children? Mice have innate drives to care for children—it’s really obvious, there are particular hormones and stereotyped cell groups in their hypothalamus and so on. Why not suppose that humans have such innate drives too? Likewise, mice have innate drives related to enjoying the company of conspecifics and conversely getting lonely without such company. Why not suppose that humans have such innate drives too?
I don't think I agree with the "but only one thought can be there at a time" part.
I’m probably just defining “thought” more broadly than you. The cortex has many areas. The auditory parts can be doing some auditory thing, and simultaneously the motor parts can be doing some motor thing, and all that together constitutes (what I call) a “thought”.
I don’t think anything in the series really hinges on the details here. “Conscious awareness” is not conceptualized as some atomic concept where there’s nothing else to say about it. If you ask people to describe their conscious awareness, they can go on and on for hours. My claim is that: when they go on and on for hours describing “conscious awareness”, all the rich details that they’re describing can be mapped onto accurate claims about properties of the cortex and its activation states. (E.g. the next part of this comment.)
I think each of the workspaces have their own short memory which spans maybe like 2seconds.
I agree that, if some part of the cortex is in activation state A at time T, those particular neurons (that constitute A) will get less and less active over the course of a second or two, such that at time T+1 second, it’s still possible for other parts of the cortex to reactivate activation state A via an appropriate query. I don’t think all traces of A immediately disappear entirely every time a new activation state appears.
Again, I think the capability of the cortex to have more than one incompatible generative models active simultaneously is extremely limited, and that for most purposes we should think of it as only having a single (MAP estimate) generative model active. But the capability to track multiple incompatible models simultaneously does exist, to some extent. I think this slow-fading thing is one application of that capability. Another is the time-extended probabilistic inference thing that I talked about in §2.3.
OK, my theory is:
- (A) There’s a thing where people act kind towards other people because it’s in their self-interest to act kind—acting kind will ultimately lead to eating yummier food, avoiding pain, and so on. In everyday life, we tend to associate this with flattery, sucking up, deception, insincerity, etc., and we view it with great skepticism, because we recognize (correctly) that such a person will act kind but then turn right around and stab you in the back as soon as the situation changes.
- (B) There’s a separate thing where people act kind towards other people because there’s some innate drive / primary reward / social instinct closely related to acting kind towards other people, e.g. feeling that the other person’s happiness is its own intrinsic reward. In everyday life, we view this thing very positively, because we recognize that such a person won’t stab you in the back when the situation changes.
I keep trying to pattern-match what you’re saying to:
- (C) [which I don’t believe in] This is a third category of situations where people are kind. Like (A), it ultimately stems from self-interest. But like (B), it does not entail the person stabbing you in the back as soon as the situation changes (such that stabbing you in the back is in their self-interest). And the way that works is over-generalization. In this story, the person finds that it’s in their self-interest to act kind, and over-generalizes this habit to act kind even in situations where it’s not in their self-interest.
And then I was saying that that kind of over-generalization story proves too much, because it would suggest that I would retain my childhood habit of not-driving-cars, and my childhood habit of saying that my street address is 18 Main St., etc. And likewise, it would say that I would continue to wear winter coats when I travel to the tropics, and that if somebody puts a toy train on my plate at lunchtime I would just go right ahead and eat it, etc. We adults are not so stupid as to over-generalize like that. We learn to adapt our behavior to the situation, and to anticipate relevant consequences.
But maybe that’s not what you’re arguing? I’m still kinda confused. You wrote “But across almost all environments, you get positive feedback from being nice to people and thus feel or predict positive valence about these.” I want to translate that as: “All this talk of stabbing people in the back is irrelevant, because there is practically never a situation where it’s in somebody’s self-interest to act unkind and stab someone in the back. So (A) is really just fine!” I don’t think you’d endorse that, right? But it is a possible position—I tend to associate it with @Matthew Barnett. I agree that we should all keep in mind that it’s very possible for people to act kind for self-interested reasons. But I strongly don’t believe that (A) is sufficient for Safe & Beneficial AGI. But I think that you’re already in agreement with me about that, right?
I was under the impression that “illusions of free will” was a standard term in the literature, but I just double-checked and I guess I was wrong, it just happened to be in one paper I read. Oops. So I guess I’m entitled to use whatever term seems best to me.
I mildly disagree that it’s unrelated to “free will”, but I agree that all things considered your suggestion “illusions of intentionality” is a bit better. I’m changing it, thanks!
I’m not too sure what you’re arguing.
I think we agree that motivations need to ground out directly or indirectly with “primary rewards” from innate drives (pain is bad, eating-when-hungry is good, etc., other things equal). (Right?)
And then your comment kinda sounds like you’re making the following argument:
There’s no need to posit the existence of an innate drive / primary reward that ever makes it intrinsically rewarding to be nice to people, because “you get positive feedback from being nice to people”, i.e. you will notice from experience that “being nice to people” will tend to lead to (non-social) primary rewards like eating-when-hungry, avoiding pain, etc., so the learning algorithm in your brain will sculpt you to have good feelings around being nice to people.
If that’s what you’re trying to say, then I strongly disagree and I’m happy to chat about that … but I was under quite a strong impression that that’s not what you believe! Right?
I thought that you believed that there is a primary reward / innate drive that makes it feel intrinsically rewarding for adults to be nice (under certain circumstances); if so, why bring up childhood at all?
Sorry if I’m confused :)