Posts
Comments
agents that have preferences about the state of the world in the distant future
What are these preferences? For biological agents, these preferences are grounded in some mechanism - what you call Steering System - that evaluates "desirable states" of the world in some more or less directly measurable way (grounded in perception via the senses) and derives a signal of how desirable the state is, which the brain is optimizing for. For ML models, the mechanism is somewhat different but there is also an input to the training algorithm that determines how "good" the output is. This signal is called reward and drives the system toward outputs that lead to states of high reward. But the path there depends on the specific optimization method and the algorithm has to navigate such a complex loss landscape that it can get stuck in areas of the search space that correspond to imperfect models for very long if not for ever. These imperfect models can be off in significant ways and that's why it may be useful to say that Reward is not the optimization target.
The connection to Intuitive Self-Models is that even though the internal models of an LLM may be very different from human self-models, I think it is still quite plausible that LLMs and other models form models of the self. Such models are instrumentally convergent. Humans talk about the self. The LLM does things that matches these patterns. Maybe the underlying process in humans that give rise to this is different, but humans learning about this can't know the actual process either. And in the same way the approximate model the LLM forms is not maximizing the reward signal but can be quite far from it as long it is useful (in the sense of having higher reward than other such models/parameter combinations).
I think of my toenail as “part of myself”, but I’m happy to clip it.
Sure, the (body of the) self can include parts that can be cut/destroyed without that "causing harm" but instead having an overall positive effect. The AI in a compute center would in analogy also consider decommissioning failed hardware. And when defining humanity, we do have to be careful what we mean when these "parts" could be humans.
About conjoined twins and the self:
Krista and Tatiana Hogan (Wikipedia) are healthy functional conjoined craniopagus twins who are joined at the head and share parts of the brain - their thalamus is joined via a thalamic bridge: They can report on preceptions of the other and share affects.
I couldn't find scientific papers that studied their brain function rigorously, but the paper A Case of Shared Consciousness looks at evidence from documentaries and discusses it. Here are some observational details:
Each is capable of reporting on inputs presented to the other twin’s body. For example, while her own eyes are covered, Tatiana is able to report on visual inputs to both of Krista’s eyes. Meanwhile, Krista can report on inputs to one of Tatiana’s eyes. Krista is able to report and experience distaste towards food that Tatiana is eating (the reverse has not been reported, but may also be true). An often repeated anecdote is that while Tatiana enjoys ketchup on her food, Krista will try to prevent her eating it. Both twins can also detect when and where the other twin’s body is being touched, and their mother reports that they find this easier than visual stimuli.
fMRI imaging revealed that Tatiana’s brain ‘processes signals’ from her own right leg, both her arms, and Krista’s right arm (the arm on the side where they connect). Meanwhile Krista’s brain processes signals from her own left arm, both her own legs and Tatiana’s left leg (again on the side where they connect). Each twin is able to voluntarily move each of the limbs corresponding to these signals.
The twins are also capable of voluntary bodily control for all the limbs within their ordinary body plans. As their mother Felicia puts it, “they can choose when they want to do it, and when they don’t want to do it.”
The twins also demonstrate a common receptivity to pain. When one twin’s body is harmed, both twins cry.
The twins report that they talk to each other in their heads. This had previously been suspected by family members due to signs of apparent collusion without verbalisation.
Popular article How Conjoined Twins Are Making Scientists Question the Concept of Self contains many additional interesting bits:
when a pacifier was placed in one infant’s mouth, the other would stop crying.
About the self:
Perhaps the experience of being a person locked inside a bag of skin and bone—with that single, definable self looking out through your eyes—is not natural or given, but merely the result of a changeable, mechanical arrangement in the brain. Perhaps the barriers of selfhood are arbitrary, bendable. This is what the Hogan twins’ experience suggests. Their conjoined lives hint at myriad permeations of the bodily self.
About qualia:
Tatiana senses the greeniness of Krista’s experience all the time. “I hate it!” she cries out, when Krista tastes some spinach dip.
(found via FB comment)
A much smaller subset was also published here, but does include documents:
Instrumental power-seeking might be less dangerous if the self-model of the agent is large and includes individual humans, groups, or even all of humanity and if we can reliably shape it that way.
It is natural for humans to for form a self-model that is bounded by the body, though it is also common to be only the brain or the mind, and there are other self-models. See also Intuitive Self-Models.
It is not clear what the self-model of an LLM agent would be. It could be
- the temporary state of the execution of the model (or models),
- the persistently running model and its memory state,
- the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs,
- the physical compute resources in some compute center(s),
- the compute center as an organizational structure that includes the staff to maintain and operate not only the machines but also the formal organization (after all, without that, the machines will eventually fail), or
- dito but including all the utilities and suppliers to continue to operate it.
There is not as clear a physical boundary as in the human case. But even in the human case, esp. babies depend on caregivers to a large degree.
There are indications that we can shape the self-model of LLMs: Self-Other Overlap: A Neglected Approach to AI Alignment
This sounds related to my complaint about the YUDKOWSKY + WOLFRAM ON AI RISK debate:
I wish there had been some effort to quantify @stephen_wolfram's "pockets or irreducibility" (section 1.2 & 4.2) because if we can prove that there aren't many or they are hard to find & exploit by ASI, then the risk might be lower.
I got this tweet wrong. I meant if pockets of irreducibility are common and non-pockets are rare and hard to find, then the risk from superhuman AI might be lower. I think Stephen Wolfram's intuition has merit but needs more analysis to be convicing.
There are two parts to the packaging that you have mentioned:
- optimizing transport (not breaking the TV) is practical and involves everything but the receiver
- enhancing reception (nice present wrapping) is cultural and involves the receiver(s)
Law of equal (or not so equal) opposite advice: The are some - probably few - flaws that you can keep because they are small and not worth the effort to fix or make you more lovable and unique.
Example:
- I'm a very picky eater. No sauces, no creams, no spicy foods. Lots of things excluded. It limits what i can eat and i always have to explain.
But don't presume any flaw you are attached to falls into this category. I'm also not strongly convinced of this.
a lot of the current human race spends a lot of time worrying - which I think probably has the same brainstorming dynamic and shares mechanisms with the positively oriented brainstorming. I don't know how to explain this; I think the avoidance of bad outcomes being a good outcome could do this work, but that's not how worrying feels - it feels like my thoughts are drawn toward potential bad outcomes even when I have no idea how to avoid them yet.
If we were not able to think about potentially bad outcomes well, that would a problem as clearly thinking about them is what avoids them, hopefully. But the question is a good one. My first intuition was that maybe the importance of an outcome - in both directions, good and bad - is relevant.
I like the examples from 8.4.2:
- Note the difference between saying (A) “the idea of going to the zoo is positive-valence, a.k.a. motivating”, versus (B) “I want to go to the zoo”. [...]
- Note the difference between saying (A) “the idea of closing the window popped into awareness”, versus (B) “I had the idea to close the window”. Since (B) involves the homunculus as a cause of new thoughts, it’s forbidden in my framework.
I think it could be an interesting mental practice to rephrase inner speech involving "I" in this way. I have been doing this for a while now. It started toward the end of my last meditation retreat when I switched to a non-CISM (or should I say "there was a switch in the thoughts about self-representation"?). Using "I" in mental verbalization felt like a syntax error and other phrasings like you are suggesting here, felt more natural. Interestingly, it still makes sense to use "I" in conversations to refer to me (the speaker). I think that is part of why the CISM is so natural: It uses the same element in internal and external verbalizations[1].
Pondering your examples, I think I would render them differently. Instead of: "I want to go to the zoo," it could be: "there is a desire to go to the zoo." Though I guess if "desire to" stands for "positive-valence thought about") it is very close to your "the idea of going to the zoo is positive-valence."
In practice, the thoughts would be smaller, more like "there is [a sound][2]," "there is a memory of [an animal]," "there is a memory of [an episode from a zoo visit]," "there is a desire to [experience zoo impressions]," "there is a thought of [planning]." The latter gets complicated. The thought of planning could be positive valence (because plans often lead to desirable outcomes) or the planning is instrumentally useful to get the zoo impressions (which themselves may be associated with desirable sights and smells), or the planning can be aversive (because effortful), but still not strong enough to displace the desirable zoo visit.
For an experienced meditator, the fragments that can be noticed can be even smaller - or maybe more pre-cursor-like. This distinction is easier to see with a quiet mind, where, before a thought fully occupies attention, glimmers of thoughts may bubble up[3]. This is related to noticing that attention is shifting. The everyday version of that happens why you notice that you got distracted by something. The subtler form is noticing small shifts during your regular thinking (e.g., I just noticed my attention shifting to some itch, without that really interuping my writing flow). But I'm not sure how much of that is really a sense of attention vs. a retroactive interpretation of the thoughts. Maybe a more competent meditator can comment.
- ^
And now I wonder whether the phonological loop, or whatever is responsible for language-like thoughts, maybe subvocalizations, is what makes the CISM the default model.
- ^
[brackets indicate concepts that are described by words, not the words themselves]
- ^
The question is though, what part notices the noticing. Some thought of [noticing something] must be sufficiently stable and active to do so.
I think your explanation in section 8.5.2 resolves our disagreement nicely. You refer to S(X) thoughts that "spawn up" successive thoughts that eventually lead to X (I'd say X') actions shortly after (or much later). While I was referring to S(X) that cannot give rise to X immediately. I think the difference was that you are more lenient with what X can be, such that S(X) can be about an X that is happening much later, which wouldn't work in my model of thoughts.
Explicit (self-reflective) desire
Statement: “I want to be inside.”
Intuitive model underlying that statement: There’s a frame (§2.2.3) “X wants Y” (§3.3.4). This frame is being invoked, with X as the homunculus, and Y as the concept of “inside” as a location / environment.
How I describe what’s happening using my framework: There’s a systematic pattern (in this particular context), call it P, where self-reflective thoughts concerning the inside, like “myself being inside” or “myself going inside”, tend to trigger positive valence. That positive valence is why such thoughts arise in the first place, and it’s also why those thoughts tend to lead to actual going-inside behavior.
In my framework, that’s really the whole story. There’s this pattern P. And we can talk about the upstream causes of P—something involving innate drives and learned heuristics in the brain. And we can likewise talk about the downstream effects of P—P tends to spawn behaviors like going inside, brainstorming how to get inside, etc. But “what’s really going on” (in the “territory” of my brain algorithm) is a story about the pattern P, not about the homunculus. The homunculus only arises secondarily, as the way that I perceive the pattern P (in the “map” of my intuitive self-model).
As I commented on Are big brains for processing sensory input? I predict that the brain regions of a whale or Orca responsible for spatiotemporal learning and memory are a big part of their encephalization.
I'm not disagreeing with this assessment. The author has an agenda, but I don't think it's hidden in any way. It is mostly word thinking and social association. But that's how the opposition works!
I believe this has been done in Google's Multilingual Neural Machine Translation (GNMT) system that enables zero-shot translations (translating between language pairs without direct training examples). This system leverages shared representations across languages, allowing the model to infer translations for unseen language pairs.
The above link posted is a lengthy and relatively well-sourced, if biased, post about Scott Alexander's writing related to human biodiversity (HBD). The author is very clearly opposed to HBD. I think it is a decent read if you want to understand that position.
Thanks. I already got in touch with Masaharu Mizumoto.
Congrats again for the sequence! It all fits together nicely.
While it makes sense to exclude hallucinogenic drugs and seizures, at least hallucinogenic drugs seem to fit into the pattern if I understand the effect correctly.
Auditory hallucinations, top-down processing and language perception - this paper says that imbalances in top-down cortical regulation is responsible for auditory hallucinations:
Participants who reported AH in the week preceding the test had a higher false alarm rate in their auditory perception compared with those without such (recent) experiences.
And this page Models of psychedelic drug action: modulation of cortical-subcortical circuits says that hallucinogenic drugs lead to such imbalances. So it is plausibly the same mechanism.
- Scott Alexander for psychiatry and drugs and many other topics
- Paul Graham for startups specifically, but his essays cover a much wider space
- Scott Adams for persuasion, humor, and recently a lot of political commentary - not neutral; he has his own agendas
- Robin Hanson - Economics, esp. long-term, very much out-of-distribution thinking
- Zvi Mowshowitz for AI news (and some other research-heavy topics; previously COVID-19)
I second Patrick McKenzie.
I really like this one! I just with you had split it into two posts, one for Philosophy and one for Wisdom.
Finally someone gets Philosophy! Though admittedly, most of philosophy is not about philosophy these days. It is a tradition of knowledge that has lost much of its footing (see On the Loss and Preservation of Knowledge). But that's true of much of science and shouldn't lead us to ignore this core function of Philosophy: Study confusing questions in the absence of guiding structure.
The case of ethics as a field of Philosophy is interesting because it has been a part of it for so long. It suggests that people have tried and repeatedly failed to find a working ontology and make Ethics into its own paradigmatic field. I think this is so because ethics is genuinely difficult. Partly, because Intuitive Self-Models are so stable and useful but not veridical. But I think we will eventually succeed and be able to "Calculate the Morality Matrix."
Hi, is there a way to get people in touch with a project or project lead? For example, I'd like to get in touch with Masaharu Mizumoto because iVAIS sounds related to the aintelope project.
The post was likely downvoted because it conflicts with principles of empathy, cooperation, and intellectual rigor. Defending bullying, even provocatively, clashes with commonly held beliefs. The zero-sum framing of status is overly simplistic, ignoring positive-sum approaches. The provocative style comes off as antagonistic. Reframing the argument around prosocial accountability might get more positive responses.
Thanks. It doesn't help because we already agreed on these points.
We both understand that there is physical process in the brain - neurons firing etc. - as you describe in 3.3.6, that gives rise to a) S(A), b) A, and c) the precursors to both as measured by Libet and others.
We both know that people's self-reports are unreliable and informed by their intuitive self-models. To illustrate that I understand 2.3 let me give an example: My son has figured out that people hear what they expect to hear and experimented with leaving out fragments of words or sentences, enjoying himself by how people never noticed anything was off (example: "ood morning"). Here, the missing part doesn't make it into people's awareness despite the whole sentence very well does.
I'm not asserting that there is nothing upstream of S(A) that is causing it. I'm asserting that an individual S(A) is not causing A. I'm asserting so because it can't timing-wise and equivalently, that there is no neurological action path from S(A) to A. The only relation between S(A) and A is that S(A) and A co-occurring has been statistically positive valence in the past. And this co-occurrence is facilitated by a common precursor. But saying S(A) is causing A is as right or wrong as saying A is causing S(A).
I mean this (my summary of the Libet experiments and their replications):
- Brain activity detectable with EEG (Readiness Potential) begins between 350 and multiple seconds (depending on experiment and measurement resolution) before the person consciously feels the intention to act (voluntary motor movement).
- Subjects report becoming aware of their intention to act (via clock tracking) about 200 ms before the action itself (e.g., pressing a button). 200ms seems relatively fixed, but cognitive load can delay.
To give a specific quote:
Matsuhashi and Hallet: Our result suggests that the perception of intention rises through multiple levels of awareness, starting just after the brain initiates movement.
[...]
1. The first detected event in most subjects was the onset of BP. They were not aware of the movement genesis at this time, even if they were alerted by tones.
2. As the movement genesis progressed, the awareness state rose higher and after the T time, if the subjects were alerted, they could consciously access awareness of their movement genesis as intention. The late BP began within this period.
3. The awareness state rose even higher as the process went on, and at the W time it reached the level of meta-awareness without being probed. In Libet et al’s clock task, subjects could memorize the clock position at this time.
4. Shortly after that, the movement genesis reached its final point, after which the subjects could not veto the movement any more (P time).[...]
We studied the immediate intention directly preceding the action. We think it best to understand movement genesis and intention as separate phenomena, both measurable. Movement genesis begins at a level beyond awareness and over time gradually becomes accessible to consciousness as the perception of intention.
Now, I think you'd say that what they measured wasn't S(A) but something else that is causally related, but then you are moving farther away from patterns we can observe in the brain. And your theory still has to explain the subclass of those S(A) that they did measure. The participants apparently thought these to be their decisions S(A) about their actions A.
We’re definitely talking past each other somehow.
I guess this will only stop when we have made our thoughts clear enough for an implementation that allows us to inspect the system for S(A) and A. Which is OK.
At least this has helped clarify that you think of S(A) to (often) precede A by a lot, which wasn't clear to me. I think this complicates the analysis because of where to draw the line. Would it count if I imagine throwing the ball one day (S(A)) but executing it during the game the next day as I intend?
What do you make of the Libet experiments?
There is just one problem that Libet discovered: There is no time for S(A) to cause A.
My favorite example is throwing a ball: A is the releasing of the ball at the right moment to hit a target. This requires Millisecond precision of release. The S(A) is precisely timed to coincide with the release. It feels like you are releasing the ball at the moment your hand releases it. But that can't be true because the signal from the brain alone takes longer than the duration of a thought. If your theory were right, you would feel the intention to release the ball and a moment later would have the sensation of the result happening.
Now, one solution around this would be to time-tag thoughts and reorder them afterwords, maybe in memory - a bit like out-of-order execution in CPUs handles parallel execution of sequential instructions. But I'm not sure that is what is going on or that you think it is.
So, my conclusion is that there is a common cause of both S(A) and A.
And my interpretation of Daniel Ingram's comments is different from yours.
In Mind and Body, the earliest insight stage, those who know what to look for and how to leverage this way of perceiving reality will take the opportunity to notice the intention to breathe that precedes the breath, the intention to move the foot that precedes the foot moving, the intention to think a thought that precedes the thinking of the thought, and even the intention to move attention that precedes attention moving.
These "intentions to think/do" that Ingraham refers to are not things untrained people can notice. There are things in the mind that precede the S(A) and A and cause them but people normally can't notice them and thus can't be S(A). I say these precursors are the same things picked up in the Libet experiments and neurological measurements.
I think what you are looking for is this:
People used to ask me for writing advice. And I, in all earnestness, would say “Just transcribe your thoughts onto paper exactly like they sound in your head.” It turns out that doesn’t work for other people. Maybe it doesn’t work for me either, and it just feels like it does.
and
I’ve written a few hundred to a few thousand words pretty much every day for the past ten years.
But as I’ve said before, this has taken exactly zero willpower. It’s more that I can’t stop even if I want to. Part of that is probably that when I write, I feel really good about having expressed exactly what it was I meant to say. Lots of people read it, they comment, they praise me, I feel good, I’m encouraged to keep writing, and it’s exactly the same virtuous cycle as my brother got from his piano practice.
(the included link is also directly relevant)
You give the example of the door that is sometimes pushed open, but let me give alternative analogies:
- S(A): Forecaster: "The stock price of XYZ will rise tomorrow." A: XYZ's stock rises the next day.
- S(A): Drill sergeat, "There will be exercises at 14:00 hours." A: Military units start their exercises at the designated time.
- S(A): Live commentator: "The rocket is leaving the launch pad." A: A rocket launches from the ground.
Clearly, there is a reason for the co-occurrence, but it is not one causing the other. And it is useful to have the forecaster because making the prediction salient helps improve predictions. Making the drill time salient improves punctuality or routine or something. Not sure what the benefit of the rocket launch commentary is.
Otherwise I think we agree.
I want to comment on the interpretation of S(A) as an "intention" to do A.
Note that I'm coming back here from section 6. Awakening / Enlightenment / PNSE, so if somebody hasn't read that, this might be unclear.
Using the terminology above, A here is "the patterns of motor control and attention control outputs that would does collectively make my muscles actually execute the standing-up action."
And S(A) is "the patterns of motor control and attention control outputs that would does collectively make my muscles actually execute the standing-up action are in my awareness." Meaning a representation of "awareness" is active together with the container-relationship and a representation of A. (I am still very unsure about how "awareness" is learned and represented.)[1]
Referring to 2.6.2, I agree with this:
[S(A) and A] are obviously strongly associated with each other. They can activate simultaneously. And even if they don’t, each tends to bring the other to mind, such that the valence of one influences the valence of the other.
and
For any action A where S(A) has positive valence, there’s often a two-step temporal sequence: [S(A) ; A actually happens]"
I agree that in this co-occurence sense "S(X) often summons a follow-on thought of X." But it is not causing it, what "summon" might imply. This choice of word is maybe an indication of the uncertainty here.
Clearly, action A can happen without S(A) being present. In fact, actions are often more effectively executed if you don't think too hard about them[citation needed]. An S(A) is not required. Maybe S(A) and A cooccur often, but that doesn't imply causality. But, indeed, it would seem to be causal in the context of a homunculus model of action. Treating it as causal/vitalistic is predictive. The real reason is the co-occurrence of the thoughts, which can have a common cause, such as when the S(A) thought brings up additional associations that lead to higher valence thoughts/actions later (e.g., chains of S(A), A, S(A)->S(B), B).
Thus, S(A) isn't really an "Intention to do A" per se but just as it says on the tin: "awareness of (expecting) A." I would say it is only an "intention to do A" if the thought S(A) also includes the concept of intention - which is a concept tied to the homunculus and an intuitive model of agency.
- ^
I am still very unsure about how "awareness" is learned and represented. Above it says
the cortex, which has a limited computational capacity that gets deployed serially [...] When this aspect of the brain algorithm is itself incorporated into a generative model via predictive (a.k.a. self-supervised) learning, it winds up represented as an “awareness” concept,
but this doesn't say how. The brain needs to observe something (sense, interoception) from which it can infer this. The pattern in what observations would that be? The serial processing is a property the brain can't observe unless there is some way to combine/compare past and present "thoughts." That's why I have long thought that there has to be a feedback from the current thought back as input signal (thoughts as observations). Such a connection is not present in the brain-like model, but it might not be the only way. Another way would be via memory. If a thought is remembered, then one way of implementing memory would be to provide a representation of the remembered thought as input. In any case, there must be a relation between successive thoughts, otherwise they couldn't influence each other.
It seems plausible that, in a sequence of events, awareness S(A) is a related to a pattern of A having occurred previously in the sequence (or being expected to occur).
After reading all the 2.6 and 3.3 sections again, I think the answer to why the homunculus is attention-grabbing is because it involves "continuous self-surprise" in the same way an animate object (mouse...) is. A surprise that is a present as a proprioceptive signal or felt sense. With PNSE, your brain has learned to predict the internal S(X) mental objects and this signal well enough that the remaining surprisingness of the mental processes would be more like the gears contraption from 3.3.2, where "the surprising feeling that I feel would be explained away by a different ingredient in my intuitive model of the situation—namely, my own unfamiliarity with all the gears inside the contraption and how they fit together." And as such, it is easier to tune out: The mind is doing its usual thing. Process as usual.
You are not alone. Paul Graham has been writing essays for a long time, and he is revising and rewriting a lot too. Here you can see him write one of his essays as an edit replay.
Also: "only one sentence in the final version is the same in the first draft."
Also:
The drafts of the essay I published today. This history is unusually messy. There's a gap while I went to California to meet the current YC batch. Then while I was there I heard the talk that made me write "Founder Mode." Plus I started over twice, most recently 4 days ago.
I think children can sleep in most places as long as they feel safe. Some parents seem to think that their children can only sleep in tightly controlled environments: Quiet, dark, comfy. But I think that is often a result of training. If the children never sleep in any other environments how can they feel suddenly safe there? Or if the parents or other people are stressed in the other environments, children will notice that something is off and not feel safe and not sleep. But a place with lots of friendly, happy people seems quite safe to me.
I found a photo of two of my kids sleeping "on stage." This table was right next to the stage at my sisters wedding and the music was not quiet for sure.
I do think that there are mechanisms in the human brain that make prosocial behavior more intrinsically rewarding, such as the mechanisms you pointed out in the Valence sequence.
But I also notice that in the right kind of environments, "being nice to people" may predict "people being nice to you" (in a primary reward sense) to a higher degree than might be intuitive.
I don't think that's enough because you still need to ensure that the environment is sufficiently likely to begin with, with mechanisms such as rewarding smiles, touch inclinations, infant care instincts or whatever.
I think this story of how human empathy works may plausibly involve both social instincts as well as the self-interested indirect reward in very social environments.
I think the point we agree on is
habits that last through adulthood [because] the adult independently assesses those habits as being more appealing than alternatives,
I think that the habit of being nice to people is empathy.
So by the same token, when I was a little kid, yes it was in my self-interest (to some extent) for my parents to be healthy and happy. But that stopped being true as soon as I was financially independent. Why assume that people would permanently internalize that, when they fail to permanently internalize so many other aspects of childhood?
I'm not claiming that they "permanently internalize" but that they correctly (well, modulo mistakes) predict that it is their interests. You started driving a car because you correctly predicted that the situation/environment had changed. But across almost all environments, you get positive feedback from being nice to people and thus feel or predict positive valence about these.
Actually it’s worse than that—adolescents are notorious for not feeling motivated by the well-being of their parents, even while such well-being is still in their own narrow self-interest!! :-P
That depends on the type of well-being and your ability to predict it. And maybe other priorities get in the way during that age. And again, I'm not claiming unconditional goodness. The environment of young adults is clearly different from that of children, but it is comparable enough to predict positive value from being nice to your parents.
Actually, psychopaths prove this point: The anti-social behavior is "learned" in many cases during abusive childhood experiences, i.e., in environments where it was exactly not in their interest to be nice - because it didn't benefit them. And on the other side, psychopaths can, in many cases, function and show prosocial behaviors in stable environments with strong social feedback.
This also generalizes to the cultures example.
As an example of (2), a religious person raised in a religious community might stay religious by default. Until, that is, they move to the big city
I agree: In the city, many of their previous predictions of which behaviors exactly lead to positive feedback ("quoting the Bible") might be off and they will quickly learn new behaviors. But being nice to people in general, will still work. In fact, I claim, it tends to generalize even more, which is why people who have been around more varied communities tend to develop more generalized morality (higher Kegan levels).
I think the steelmaned version of beren's argument is
The potential for empathy is a natural consequence of learned reward models
That you indeed get for free. It will not get you far, as you have pointed out, because once you get more information, the model will learn to distinguish the cases precisely. And we know from observation that some mammals (specifically territorial ones) and most other animals do not show general empathy.
But there are multiple ways that empathy can be implemented with small additional circuitry. I think this is the part of beren's comment that you were referring to:
For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude. Generally I think of this mechanism as a low level substrate on which you can build up a more complex repertoire of social emotions by doing reward shaping on these signals.
But it might even be possible that no additional circuitry is required if the environment is just right. Consider the case of a very social animal in an environment where individuals, esp. young ones, rarely can take care of themselves alone. In such an environment, there may be many situations where the well-being of others predicts your own well-being. For example, if you give something to the other (and that might just be smile) that makes it more likely to be fed. This doesn't seem to necessarily require any extra circuits, though it might be more likely to bootstrap off some prior mechanisms, e.g., grooming or infant care.
This might not be stable because free-loading might evolve, but this is then secondary.
I wonder which of these cases this comment of yours is:
consider “seeing someone get unexpectedly punched hard in the stomach”. That makes me cringe a bit, still, even as an adult.
Can you say more which concept you mean exactly?
Jeff could offer to receive such stories anonymously and repost them.
You refer to status as an attribute of a person, but now I'm wondering how the brain represents status. I wouldn't rule out the possibility of high status being the same thing as the willingness to let others control you.
You might want to have a look at the
The Collected Papers of Milton H. Erickson on Hypnosis Vol 1 - The Nature of Hypnosis and Suggestion
I read it some years ago and found it insightful and plausible and fun to read, but couldn't wrap my mind around it forming a coherent theory. And form my recollection, many things in there confirm Johnstone and complement it, esp. the high-status aspects. There may be more.
Crowds are trance-inducing because the anonymity imposed by the crowd absolves you of the need to maintain your identity.
In a tight crowd, it is easiest to do what the crowd is doing, and there are attractors for what works in a crowd (e.g. speed of movement) that the crowd's dynamic takes over.
Does LessWrong need link posts for astralcodexten?
Not in general, no.
Aren't LessWrong readers already pretty aware of Scott's substack?
I would be surprised if the overlap is > 50%
I'm linkposting it because I think this fits into a larger pattern of understanding cognition that will play an important role in AI safety and AI ethics.
Hi Newbie, what are your thoughts on it?
The advice in here might very well be of the "it seems obvious once you've read it" kind, but I think it's still useful
The problem is not that people don't know what to do. Just recently, I heard a similar difficulty of esports players: They know what to do - farm gold regularly, kill enemies, keep map awareness etc., whatever. It is just in the moment that the right action is elusive.
"Why didn't you retreat when you were low on health?"
"I knew I was low on health and had to retreat! But I thought the way to retreat was left (where more trouble turned up) and not right."
Feel free to take that as a metaphor for relationships if you want XD.
That's why I like the section about the Freakout Tree so much: It describes a common conflict pattern and provides a resolution approach worth imitating.
Explicitly asking “Hey, can I have the tree?” has saved our bacon more than once.
I have updated to it being a mix. It is not only being kept in check by others. There are benevolent rulers. Not all and nit reliable, but there seems to be potential.
Convergence. Humans and LLMs with deliberation do the same thing and end up making the same class of errors
Just came across Harmonic mentioned on the AWS Science Blog. Sequoia Capital interview with the founders of Harmonic (their system which generates Lean proofs is SOTA for MiniF2F):
I would remove that last paragraph. It doesn't add to your point and gives the impression that you might have a specific agenda.
Have there been any followups or forks etc. of Botworld since it was created? It seemed very promising. There should be something.
I notice that o1's behavior (it's cognitive process) looks suspiciously like human behaviors:
- Cognitive dissonance: o1 might fabricate or rationalize to maintain internal consistency of conflicting data (which means there is inconsistency).
- Impression management/Self-serving bias: o1 may attempt to appear knowledgeable or competent, leading to overconfidence because it is rewarded for the look more than for the content (which means the model is stronger than the feedback).
But why is this happening more when o1 can reason more than previous models? Shouldn't that give it more ways to catch its own deception?
No:
- Overconfidence in plausibility: With enhanced reasoning, o2 can generate more sophisticated explanations or justifications, even when incorrect. o2 "feels" more capable and thus might trust its own reasoning more, producing more confident errors ("feels" in the sense of expecting to be able to generate explanations that will be rewarded as good).
- Lack of ground-truth: Advanced reasoning doesn't guarantee access to verification mechanisms. o2 is rewarded for producing convincing responses, not necessarily for ensuring accuracy. Better reasoning can increase the capability to "rationalize" rather than self-correct.
- Complexity in mistakes: Higher reasoning allows more complex thought processes, potentially leading to mistakes that are harder to identify or self-correct.
Most of this is analogous to how more intelligent people ("intellectuals") can generate elaborate, convincing—but incorrect—explanations that cannot be detected by less intelligent participants (who may still suspect something is off but can't prove it).
Look inside an LLM. Goodfire trained sparse autoencoders on Llama 3 8B and built a tool to work with edited versions of Llama by tuning features/concepts.
(I am loosely affiliated, another team at my current employer was involved in this)
…So that’s all that’s needed. If any system has both a capacity for endogenous action (motor control, attention control, etc.), and a generic predictive learning algorithm, that algorithm will be automatically incentivized to develop generative models about itself (both its physical self and its algorithmic self), in addition to (and connected to) models about the outside world.
Yes, and there are many different classes of such models. Most of them boring because the prediction of the effect of the agent on the environment is limited (small effect or low data rate) or simple (linear-ish or more-is-better-like).
But the self-models of social animals will quickly grow complex because the prediction of the action on the environment includes elements in the environment - other members of the species - that themselves predict the actions of other members.
You don't mention it, but I think Theory of Mind or Emphatic Inference play a large role in the specific flavor of human self-models.