Biden-Harris Administration Announces First-Ever Consortium Dedicated to AI Safety 2024-02-09T06:40:44.427Z
The intelligence-sentience orthogonality thesis 2023-07-13T06:55:59.353Z
Nature: "Stop talking about tomorrow’s AI doomsday when AI poses risks today" 2023-06-28T05:59:49.015Z
Who Aligns the Alignment Researchers? 2023-03-05T23:22:27.107Z
Grant-making in EA should consider peer-reviewing grant applications along the public-sector model 2023-01-24T15:01:22.110Z
Sets of objectives for a multi-objective RL agent to optimize 2022-11-23T06:49:45.236Z
AMC's animated series "Pantheon" is relevant to our interests 2022-10-10T05:59:23.925Z
That-time-of-year Astral Codex Ten Meetup 2022-08-17T00:02:14.600Z
Can we achieve AGI Alignment by balancing multiple human objectives? 2022-07-03T02:51:34.279Z
A brief review of the reasons multi-objective RL could be important in AI Safety Research 2021-09-29T17:09:56.728Z
Signaling Virtuous Victimhood as Indicators of Dark Triad Personalities 2021-08-26T19:18:22.366Z


Comment by Ben Smith (ben-smith) on The Aspiring Rationalist Congregation · 2024-01-12T03:32:53.717Z · LW · GW

For me the issue is that

  1. it isn't clear how you could enforce attendance or

  2. what value individual attendees could have to make it worth their while to attend regularly.

(2) is sort of a collective action/game theoretic/coordination problem.

(1) reflects the rationalist nature of the organization.

Traditional religions back up attendance by divine command. They teach absolutist, divine command theoretic accounts of morality, backed up by accounts of commands from God to attend regularly. At the most severe mode these are backed by threat of eternal hellfire for disobedience. But it doesn't usually come to that. The moralization of the attendance norm is strong enough to justify moderate amounts of social pressure to conform to it. Often that's enough.

In a rationalist congregation, if you want a regular attendance norm, you have to ground it in a rational understanding that adhering to the norm makes the organization work. I think that might work, but it's probably a lot harder because it requires a lot more cognitive steps to get to and it only works so long as attendees buy into the goal of contributing to the project for its own sake.

Comment by Ben Smith (ben-smith) on Sentience, Sapience, Consciousness & Self-Awareness: Defining Complex Terms · 2023-12-17T00:59:38.778Z · LW · GW

I tried a similar venn diagram approach more recently. I didn't really distinguish between bare "consciousness" and "sentience". I'm still not sure if I agree "aware without thoughts and feelings" is meaningful. I think awareness might alwyas be awareness of something. But nevertheless they are at least distinct concepts and they can be conceptually separated! Otherwise my model echos the one you have created earlier.

I think it's a really interesting question as to whether you can have sentience and sapience but not self-awareness. I wouldn't take a view either way. I sort of speculated that perhaps primitive animals like shrimp might fit into that category.

Comment by Ben Smith (ben-smith) on Book Review: Going Infinite · 2023-11-11T15:45:18.163Z · LW · GW

If Ray eventually found that the money was "still there", doesn't this make Sam right that "the money was really all there, or close to it" and "if he hadn’t declared bankruptcy it would all have worked out"?

Ray kept searching, Ray kept finding.

That would raise the amount collected to $9.3 billion—even before anyone asked CZ for the $2.275 billion he’d taken out of FTX. Ray was inching toward an answer to the question I’d been asking from the day of the collapse: Where did all that money go? The answer was: nowhere. It was still there.

Comment by Ben Smith (ben-smith) on Cohabitive Games so Far · 2023-10-18T06:25:49.091Z · LW · GW

What a great read. Best of luck with this project. It sounds compelling.

Comment by Ben Smith (ben-smith) on Petrov Day Retrospective, 2023 (re: the most important virtue of Petrov Day & unilaterally promoting it) · 2023-09-30T20:06:22.126Z · LW · GW

Seems to me that in this case, the two are connected. If I falsely believed my group was in the minority, I might refrain from clicking the button out of a sense of fairness or deference to the majority group. 

Consequently, the lie not only influenced people who clicked the button, it perhaps also influenced people who did not. So due to the false premise on which the second survey was based, it should be disregarded altogether. To not disregard would be to have obtained by fraud or trickery a result that is disadvantageous to all the majority group members who chose not to click, falsely believing their view was a minority.

I think, morally speaking, avoiding disadvantaging participants through fraud is more important than honoring your word to their competitors.

The key difference between this and the example is that there's a connection between the lie and the promise.

Comment by Ben Smith (ben-smith) on The intelligence-sentience orthogonality thesis · 2023-07-24T05:52:26.292Z · LW · GW

Differentiating intelligence and agency seems hugely clarifying for many discussions in alignment.

You might have noticed I didn't actually fully differentiate intelligence and agency. It seems to me to exert agency a mind needs a certain amount of intelligence, and so I think all agents are intelligent, though not all intelligences are agentic. Agents that are minimally intelligent (like simple RL agents in simple computer models) also are pretty minimally agentic. I'd be curious to hear about a counter-example.

Incidentally I also like Anil Seth's work and I liked his recent book on consciousness, apart from the bit about AGI. I read it right along with Damasio's latest book on consciousness and they paired pretty well. Seth is a bit more concrete and detail oriented and I appreciated that.

It would make it much easier to understand ideas in this area if writers used more conceptual clarity, particularly empirical consciousness researchers (philosophers can be a bit better, I think, and I say that as an empirical researcher myself). When I read that quote from Seth, it seems clear he was arguing AGI is unlikely to be an existential threat because it's unlikely to be conscious. Does he naively conflate consciousness with agency, because he's not an artificial agency researcher and hasn't thought much about it? Or does he have a sophisticated point of view about how agency and consciousness really are linked, based on his ~~couple decades of consciousness research? Seems very unlikely, given how much we know about artificial agents, but the only way to be clear is to ask him.

Similarly MANY people including empirical researchers and maybe philosophers treat consciousness and self-awareness as somewhat synonymous, or at least interdependent. Is that because they're being naive about the link, or because, as outlined in Clark, Friston, & WIlkinson's Bayesing Qualia, they have sophisticated theories based on evidence that there really are tight links between the two? I think when writing this post I was pretty sure consciousness and self-awareness were "orthogonal"/independent, and now, following other discussion in the comments here and on Facebook, I'm less clear about that. But I'd like more people do what Friston did as he explained exactly why he thinks consciousness arises from self-awareness/meta-cognition.

Comment by Ben Smith (ben-smith) on The intelligence-sentience orthogonality thesis · 2023-07-20T05:51:37.575Z · LW · GW

I found the Clark et al. (2019) "Bayesing Qualia" article very useful, and that did give me an intuition of the account that perhaps sentience arises out of self-awareness. But they themselves acknowledged in their conclusion that the paper didn't quite demonstrate that principle, and I didn't find myself convinced of it.

Perhaps what I'd like readers to take away is that sentience and self-awareness can be at the very least conceptually distinguished. Even if it isn't clear empirically whether or not they are intrinsically linked, we ought to maintain a conceptual distinction in order to form testable hypotheses about whether they are in fact linked, and in order to reason about the nature of any link. Perhaps I should call that "Theoretical orthogonality". This is important to be able to reason whether, for instance, giving our AIs a self-awareness or situational awareness will cause them to be sentient. I do not think that will be the case, although I do think that, if you gave them the sort of detailed self-monitoring feelings that humans have, that may yield sentience itself. But it's not clear!

I listened to the whole episode with Bach as a result of your recommendation! Bach hardly even got a chance to express his ideas, and I'm not much closer to understanding his account of 

meta-awareness (i.e., awareness of awareness) within the model of oneself which acts as a 'first-person character' in the movie/dream/"controlled hallucination" that the human brain constantly generates for oneself is the key thing that also compels the brain to attach qualia (experiences) to the model. In other words, the "character within the movie" thinks that it feels something because it has meta-awareness (i.e., the character is aware that it is aware (which reflects the actual meta-cognition in the brain, rather than in the brain, insofar the character is a faithful model of reality).

which seems like a crux here. 

He sort of briefly described "consciousness as a dream state" at the very end, but although I did get the sense that maybe he thinks meta-awareness and sentience are connected, I didn't really hear a great argument for that point of view.

He spent several minutes arguing that agency, or seeking a utility function, is something humans have, but that these things aren't sufficient for consciousness (I don't remember whether he said whether they were necessary, so I suppose we don't know if he thinks they're orthogonal).

Comment by Ben Smith (ben-smith) on The intelligence-sentience orthogonality thesis · 2023-07-20T04:51:26.157Z · LW · GW

I wanted to write myself about a popular confusion between decision making, consciousness, and intelligence which among other things leads to bad AI alignment takes and mediocre philosophy.

This post has not got a lot of attention, so if you write your own post, perhaps the topic will have another shot at reaching popular consciousness (heh), and if you succeed, I might try to learn something about how you did it and this post did not!

I wasn't thinking that it's possible to separate qualia perception and self awareness

Separating qualia and self-awareness is a controversial assertion and it seems to me people have some strong contradictory intuitions about it!

I don't think, in the experience of perceiving red, there necessarily is any conscious awareness of oneself--in that moment there is just the qualia of redness. I can imagine two possible objections: (a) perhaps there is some kind of implicit awareness of self in that moment that enables the conscious awareness of red, or  (b) perhaps it's only possible to have that experience of red within a perceptual framework where one has perceived onesself. But personally I don't find either of those accounts persuading.

I think flow states are also moments where one's awareness can be so focused on the activity one is engaged in that one momentarily loses any awareness of one's own self.

there is no intersection between sentience and intelligence that is not self-awarness. 

I should have defined intelligence in the post--perhaps i"ll edit. The only concrete and clear definition of intelligence I'm aware of is psychology's g factor, which is something like the ability to recognize patterns and draw inferences from them. That is what I mean--no more than that.

A mind that is sentient and intelligent but not self aware might look like this: when a computer programmer is deep in the flow state of bringing a function in their head into code on the screen, they may experience moments of time where they have sentient awareness of their work, and certainly are using intelligence to transform their ideas into code, but do not in those particular moments have any awareness of self.

Comment by Ben Smith (ben-smith) on The intelligence-sentience orthogonality thesis · 2023-07-14T04:47:01.032Z · LW · GW

Thank you for the link to the Friston paper. I'm reading that and will watch Lex Fridman's interview with Joscha Bach, too. I sort of think "illusionism" is a bit too strong, but perhaps it's a misnomer rather than wrong (or I could be wrong altogether). Clark, Friston, and Wilkinson say

But in what follows we aim not to Quine (explain away) qualia but to ‘Bayes’ them – to reveal them as products of a broadly speaking rational process of inference, of the kind imagined by the Reverend Bayes in his (1763) treatise on how to form and update beliefs on the basis of new evidence. Our story thus aims to occupy the somewhat elusive ‘revisionary’ space, in between full strength ‘illusionism’ (see below) and out-and-out realism

and I think somewhere in the middle sounds more plausible to me.

Anyhow, I'll read the paper first before I try to respond more substantively to your remarks, but I intend to!

Comment by Ben Smith (ben-smith) on Your Dog is Even Smarter Than You Think · 2023-07-11T15:55:54.074Z · LW · GW

great post, two points of disagreement that are worth mentioning

  1. Exploring the full ability of dogs and cats to communicate isn't so much impractical to do in academia; it just isn't very theoretically interesting. We know animals can do operant conditioning (we've known for over 100 years probably), but we also know they struggle with complex syntax. I guess there's a lot of uncertainty in the middle, so I'm low confidence about this. But generally to publish a high impact paper about dog or cat communication you'd have to show they can do more than "conditioning", that they understand syntax in some way. That's probably pretty hard; maybe you can do it, but do you want to stake your career on it?
  2. That brings me to my second it more than operant conditioning? Some of the videos show the animals pressing multiple buttons. But Billy the Cat's videos show his trainer teaching his button sequences. I'm not a language expert, but to demonstrate syntax understanding, you have to do more than show he can learn sequences of button presses he was taught verbatim. At a minimum there'd need to be evidence he can form novel sentences by combining buttons in apparently-intentional ways that could only be put together by generalizing from some syntax rules. Maaaybe @Adele Lopez 's observation that Bunny seems to reverse her owner's word order might be appropriate evidence. But if she's been reinforced for her own arbitrarily chosen word order in the past, she might develop it without really appreciating rules of syntax per se. In fact, a hallmark of learning language is that you can learn syntax correctly.
Comment by Ben Smith (ben-smith) on Blanchard's Dangerous Idea and the Plight of the Lucid Crossdreamer · 2023-07-11T13:24:49.765Z · LW · GW

There's not just acceptance at stake here. Medical insurance companies are not typically going to buy into a responsibility to support clients' morphological freedom, as if medically transitioning is in the same class of thing as a cis person getting a facelift woman getting a boob job, because it is near-universally understood this is an "elective" medical procedure. But if their clients have a "condition" that requires "treatment", well, now insurers are on the hook to pay.

A lot of mental health treatment works the same way imho--people have various psychological states, many of which get inappropriately shoehorned into a pathology or illness narrative in order to get the insurance companies to pay.

All this adds a political dimension to the not inconsiderable politics of social acceptance.

Comment by Ben Smith (ben-smith) on The easy goal inference problem is still hard · 2023-07-03T05:48:48.383Z · LW · GW

I guess this falls into the category of "Well, we’ll deal with that problem when it comes up", but I'd imagine when a human preference in a particular dilemma is undefined or even just highly uncertain, one can often defer to other rules like--rather than maximize an uncertain preference, default to maximizing the human's agency, in scenarios where preference is unclear, even if this predictably leads to less-than-optimal preference satisfaction.

Comment by Ben Smith (ben-smith) on Nature: "Stop talking about tomorrow’s AI doomsday when AI poses risks today" · 2023-06-29T05:47:23.167Z · LW · GW

I think your point is interesting and I agree with it, but I don't think Nature are only addressing the general public. To me, it seems like they're addressing researchers and policymakers and telling them what they ought to focus on as well.

Comment by Ben Smith (ben-smith) on Simpler explanations of AGI risk · 2023-05-21T00:58:39.286Z · LW · GW

Well written, I really enjoyed this. This is not really on topic but I'd be curious to read and "idiot's guide" or maybe an "autist's guide" on how to avoid sounding condescending.

Comment by Ben Smith (ben-smith) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T16:03:28.223Z · LW · GW

interpretability on pretrained model representations suggest they're already internally "ensembling" many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction

That seems encouraging to me. There's a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it's capabilities to bear on achieving that goal. It does this by having a "world model" that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.

In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won't be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.

I wonder if there's something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.

Comment by Ben Smith (ben-smith) on Clippy, the friendly paperclipper · 2023-03-02T05:00:44.215Z · LW · GW

I've been writing about multi-objective RL and trying to figure out a way that an RL agent could optimize for a non-linear sum of objectives in a way that avoids strongly negative outcomes on any particular objective.

Comment by Ben Smith (ben-smith) on Interpersonal alignment intuitions · 2023-02-24T06:43:09.608Z · LW · GW

This sounds like a very interesting question.

I get stuck trying to answer your question itself on the differences between AGI and humans.

But taking your question itself at its face:

ferreting out the fundamental intentions

What sort of context are you imagining? Humans aren't even great at identifying the fundamental reason for their own actions. They'll confabulate if forced to.

Comment by Ben Smith (ben-smith) on Please don't throw your mind away · 2023-02-23T07:37:16.142Z · LW · GW

thank you for writing this. I really personally appreciate it!

Comment by Ben Smith (ben-smith) on Quick notes on “mirror neurons” · 2022-10-10T00:32:50.775Z · LW · GW

That's smart! When I started graduate school in psychology in 2013, mirror neurons felt like, colloquially, "hot shit", but within a few years, people had started to cringe quite dramatically whenever the phrase was used. I think your reasoning in (3) is spot on.

Your example leads to fun questions like, "how do I recognize juggling", including "what stimuli activate the concept of juggling when I do it" vs "what stimuli activate the concept of juggling when I see you do it"?, and intuitively, nothing there seems to require that those be the same neurons, except the concept of juggling itself. 

Empirically I would probably expect to see a substantial overlap in motor and/or somatosensory areas. One could imagine the activation pathway there is something like

visual cortex [see juggling]->temporal cortex [concept of juggling]->motor cortex[intuitions of moving arms]

And we'd also expect to see some kind of direct "I see you move your arm in x formation"->"I activate my own processes related to moving my arm in x formation" that bypasses the temporal cortex altogether.

And we could probably come up with more pathways that all cumulatively produce "mirror neural activity" which activates both when I see you do a thing and when I do that same thing. Maybe that's a better concept/name?

Comment by Ben Smith (ben-smith) on Open Problems in Negative Side Effect Minimization · 2022-10-10T00:24:27.837Z · LW · GW

Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.

Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined. 

Comment by Ben Smith (ben-smith) on Why I think there's a one-in-six chance of an imminent global nuclear war · 2022-10-08T15:40:45.704Z · LW · GW

You haven't factored in the possibility Putin gets deposed by forces inside Russia who might be worried about a nuclear war and conditional on use of tactical nukes, intuitively that seems likely enough to materially lower p(kaboom).

Comment by Ben Smith (ben-smith) on Covid 9/1/22: Meet the New Booster · 2022-09-29T19:04:31.629Z · LW · GW

American Academy of Pediatrics lies to us once again....

"If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers."
This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence.

Is it, though? I'm no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true. 

Blindness affects congenitally blind children’s development in different ways, language development being one of the areas less affected by the lack of vision.

Most researchers have agreed upon the fact that blind children’s morphological development, with the exception of personal and possessive pronouns, is not delayed nor impaired in comparison to that of sighted children, although it is different.
As for syntactic development, comparisons of MLU scores throughout development indicate that blind children are not delayed when compared to sighted children
Blind children use language with similar functions, and learn to perform these functions at the same age as sighted children. Nevertheless, some differences exist up until 4;6 years; these are connected to the adaptive strategies that blind children put into practice, and/or to their limited access to information about external reality. However these differences disappear with time (Pérez-Pereira & Castro, 1997). The main early difference is that blind children tend to use self-oriented language instead of externally oriented language.

I don't know exactly where that leaves us evidentially. Perhaps the AAP is lying by omission by not telling us about things other than language that are affected by children's sight.

That's a bit different to the dishonesty alleged, though.

Comment by Ben Smith (ben-smith) on [Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts · 2022-09-08T06:48:20.412Z · LW · GW

Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!

I feel like I agree with your take on "little glimpses of empathy" 100%.

I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combination of instinct and learning from scratch, young humans and many animals learn there are agents in the world who move in ways that don't conform to the simple rules of physics they are learning. These agents seem to have internally driven and unpredictable behavior, in the sense their movement can't be predicted by simple rules like "objects tend to move to the ground unless something stops them" or "objects continue to maintain their momentum". It seems like a young human could learn an awful lot of that from scratch, and even develop (in their thought generator) a concept of an agent. 

Because of their unpredictability, agent concepts in the thought generator would be linked to thought assessor systems related to both reward and fear; not necessarily from prior learning derived from specific rewarding and fearful experiences, but simply because, as their behavior can't be predicted with intuitive physics, there remains a very wide prior on what will happen when an agent is present.

In that sense, when a neocortex is first formed, most things in the world are unpredictable to it, and an optimally tuned thought generator+assessor would keep circuits active for both reward or harm. Over time, as the thought generator learns folk physics, most physical objects can be predicted, and it typically generates thoughts in line with their actual beahavior. But agents are a real wildcard: their behavior can't be predicted by folk physics, and so they perceived in a way that every other object in the world used to be: unpredictable, and thus continually predicting both reward and harm in an opponent process that leads to an ambivalent and uneasy neutral. This story predicts that individual differences in reward and threat sensitivity would particularly govern the default reward/threat balance otherwise unknown items. It might (I'm really REALLY reaching here) help to explain why attachment styles seem so fundamentally tied to basic reward and threat sensitivity.

As the thought generator forms more concepts about agents, it might even learn that agents can be classified with remarkable predictive power into "friend" or "foe" categories, or perhaps "mommy/carer" and "predator" categories. As a consequence of how rocks behave (with complete indifference towards small children), it's not so easy to predict behavior of, say, falling rocks with "friend" or "foe" categories. On the contrary, agents around a child are often not indifferent to children, making it simple for the child to predict whether favorable things will happen around any particular agent by classifying agents into "carer" or "predator" categories. These categories can be entirely learned; clusters of neurons in the thought generator that connect to reward and threat systems in the steering system and/or thought assessor. So then the primary task of learning to predict agents is simply whether good things or bad things happen around the agent, as judged by the steering system.

This story would also predict that, before the predictive power of categorizing agents into "friend" vs. "foe" categories has been learned, children wouldn't know to place agents into these categories. They'd take longer to learn whether an agent is trustworthy or not, particularly so if they haven't learned what an agent is yet. As they grow older, they get more comfortable with classifying agents into "friend" or "foe" categories and would need fewer exemplars to learn to trust (or distrust!) a particular agent.

Comment by Ben Smith (ben-smith) on That-time-of-year Astral Codex Ten Meetup · 2022-09-01T01:02:49.595Z · LW · GW

Event is on tonight as planned at 7. If you're coming, looking forward to seeing you!

Comment by Ben Smith (ben-smith) on Inner Alignment in Salt-Starved Rats · 2022-08-29T05:27:21.645Z · LW · GW

I wrote a paper on another experiment by Berridge reported in Zhang & Berridge (2009). Similar behavior was observed in that experiment, but the question explored was a bit different. They reported a behavioral pattern in which rats typically found moderately salty solutions appetitive and very salty solutions aversive. Put into salt deprivation, rats then found both solutions appetitive, but the salty solution less so. 

They (and we) took it as given that homeostatic regulation set a 'present value' for salt that was dependent on the organism's current state. However, in that model, you would think rats would most prefer the extremely salty solution. But in any state, they prefer the moderately salty solution. 

In a CABN paper, we pointed out this is not explainable when salt value is determined by a single homeostatic signal, but is explainable when neuroscience about the multiple salt-related homeostatic signals is taken into account. Some fairly recent neuroscience by Oka & Lee (and some older stuff too!) is very clear about the multiple sets of pathways involved. Because there are multiple regulatory systems for salt balance, the present value of these can be summed (as in your "multi-dimensional rewards" post) to get a single value signal that tracks the motivation level of the rat for the rewards involved.

Comment by Ben Smith (ben-smith) on [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering · 2022-08-29T00:37:37.635Z · LW · GW

Hey Steve, I am reading through this series now and am really enjoying it! Your work is incredibly original and wide-ranging as far as I can see--it's impressive how many different topics you have synthesized.

I have one question on this post--maybe doesn't rise above the level of 'nitpick', I'm not sure. You mention a "curiosity drive" and other Category A things that the "Steering Subsystem needs to do in order to get general intelligence". You've also identified the human Steering Subsystem as the hypothalamus and brain stem.

Is it possible things like a "curiosity drive" arises from, say, the way the telenchephalon is organized, rather than from the Steering Subsystem itself? To put it another way, if the curiosity drive is mainly implemented as motivation to reduce prediction error, or fill the the neocortex, how confident are you in identifying this process with the hypothalamus+brain stem?

I think I imagine the way in which I buy the argument is something like "steering system ultimately provides all rewards and that would include reward from prediction error". But then I wonder if you're implying some greater role for the hypothalamus+brain stem or not.

Comment by Ben Smith (ben-smith) on Multi-dimensional rewards for AGI interpretability and control · 2022-08-28T22:00:14.807Z · LW · GW

Very late to the party here. I don't know how much of the thinking in this post you still endorse or are still interested in. But this was a nice read. I wanted to add a few things:

 - since you wrote this piece back in 2021, I have learned there is a whole mini-field of computer science dealing with multi-objective reward learning, maybe centered around . Maybe a good place to start there is

 - The shard theory folks have done a fairly good job sketching out broad principles but it seems to me the homeostatic regulation does a great job of modulating which values happen to be relevant at any one time-- Xavier Roberts-Gaal recently recommended "Where do values come from?" to me and that paper sketches out a fairly specific theory for how this happens (I think it might be that more homeostatic recalculation happens physiologically rather than neurologically, but otherwise buy what they are saying)

 - Continue to think the vmPFC is relevant because different parts are known to calculate value of different aspects of stimuli; this can be modulated by state from time to time. a recent paper in this by Luke Chang & colleagues is a neural signature of reward

Comment by Ben Smith (ben-smith) on Shard Theory: An Overview · 2022-08-12T06:01:43.544Z · LW · GW

At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don't always optimize for some ultimate goal:

  • Overall, Shard theory is developed to describe behavior of human agents whose inputs and outputs are multi-faceted. I think something about this structure might facilitate the development of shards in many different directions. This seems different to modern deep RL agent; although they also potentially can have lots of input and output nodes, these are pretty finely honed to achieve a fairly narrow goal, and so in a sense, it is not too much of a surprise they seem to Goodhart on the goals they are given at times. In contrast, there’s no single terminal value or single primary reinforcer in the human RL system: sugary foods score reward points, but so do salty foods when the brain’s subfornical region indicates there’s not enough sodium in the bloodstream (Oka, Ye, Zuker, 2015); water consumption also gets reward points when there’s not enough water. So you have parallel sets of reinforcement developing from a wide set of primary reinforcers all at the same time.
  • As far as I know, a typical deep RL agent is structured hierarchically, with feedforward connections from inputs at one end to outputs at the other, and connections throughout the system reinforced with backpropagation. The brain doesn't use backpropagation (though maybe it has similar or analogous processes); it seems to "reward" successful (in terms of prediction error reduction, or temporal/spatial association, or simply firing at the same time...?) connections throughout the neocortex, without those connections necessarily having to propagate backwards from some primary reinforcer.

The point about being better at credit assignment as you get older is probably not too much of a concern. It’s very high level, and to the extent it is true, mostly attributable to a more sophisticated world model. If you put a 40 year old and an 18 year old into a credit assignment game in a novel computer game environment, I doubt the 40 year old will do better. they might beat a 10 year old, but only to the extent the 40 year old has learned very abstract facts about associations between objects which they can apply to the game. speed it up so that they can’t use system 2 processing, and the 10 year old will probably beat them.

Comment by Ben Smith (ben-smith) on AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical · 2022-07-31T15:36:42.182Z · LW · GW

I have pointed this out to folks in the context of AI timelines: metaculus gives predictions for "weakly AGI" but I consider hypothetical GATO-x which can generalize to a task outside it's training distribution or many tasks outside it's training distribution to be AGI, yet a considerable way from an AGI with enough agency to act on its own.

OTOH it isn't so much reassurance if bootstrapping this thing up to agency with as little as a batch script to keep it running will make it agentic.

But the time between weak AGI and agentic AGI is a prime learning opportunity and the lesson is we should do everything we can to prolong the length of the time between them once weak AGI is invented.

Also, perhaps someone should study the necessary components for an AGI takeover by simulating agent behavior in a toy model. At the least you need a degree of agency, probably a self model in order to recursively self-improve, and the ability to generalize. Knowing what the necessary components are might enable us to take steps to avoid having them in once system all at once.

If anyone has ever demonstrated, or even systematically described, what those necessary components are, I haven't seen it done. Maybe it is an infohazard but it also seems like necessary information to coordinate around.

Comment by Ben Smith (ben-smith) on Preprint is out! 100,000 lumens to treat seasonal affective disorder · 2022-07-04T23:02:39.960Z · LW · GW

You mentioned in the pre-print that results were "similar" for the two color temperatures, and referred to the Appendix for more information, but it seems like the Appendix isn't included in your pre-print. Are you able to elaborate on how similar results in these two conditions were? In my own personal exploration of this area I have put a lot of emphasis on color temperature. Your study makes me adjust down the importance of color temperature, although it would be good to get more information.

Comment by Ben Smith (ben-smith) on We Need a Consolidated List of Bad AI Alignment Solutions · 2022-07-04T15:54:07.746Z · LW · GW

A consolidated list of bad or incomplete solutions could have considerable didactic value--it could keep people learn more about the various challenges involved.

Comment by Ben Smith (ben-smith) on Looking back on my alignment PhD · 2022-07-03T01:26:32.737Z · LW · GW

Not sure what I was thinking about, but probably just that my understanding is that "safe AGI via AUP" would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.

Your "social dynamics" section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I'll post soon, so thank you for that!

Comment by Ben Smith (ben-smith) on Looking back on my alignment PhD · 2022-07-02T21:06:12.966Z · LW · GW

That was an inspiring and enjoyable read!

Can you say why you think AUP is "pointless" for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.

I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.

Comment by Ben Smith (ben-smith) on A descriptive, not prescriptive, overview of current AI Alignment Research · 2022-07-01T00:57:30.961Z · LW · GW

I would very much like to see your dataset, as a zotero database or some other format, in order to better orient myself to the space. Are you able to make this available somehow?

Comment by Ben Smith (ben-smith) on A descriptive, not prescriptive, overview of current AI Alignment Research · 2022-07-01T00:55:40.379Z · LW · GW

Very very helpful! The clustering is obviously a function of the corpus. From your narrative, it seems like you only added the missing arx.iv files after clustering. Is it possible the clusters would look different with those in?

Comment by Ben Smith (ben-smith) on Open Problems in Negative Side Effect Minimization · 2022-06-30T21:50:25.860Z · LW · GW

One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.

This might be a way forward to deal with your "High-Impact Interference" problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI.  When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.

Comment by Ben Smith (ben-smith) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-09T15:23:21.371Z · LW · GW

It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.

At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?

That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users' ability to come to it and request the first one be shut down..

Comment by Ben Smith (ben-smith) on What's up with the recent monkeypox cases? · 2022-05-22T23:19:29.234Z · LW · GW

Thanks for your thorough response. It is well-argued and as a result, I take back what I said. I'm not entirely convinced by your response but I will say I now have no idea! Being low-information on this, though, perhaps my reaction to the "challenge trial" idea mirrors other low-information responses, which is going to be most of them, so I'll persist in explaining my thinking mainly in the hope it'll help you and other pro-challenge people argue your case to others.

I'll start with maybe my biggest worry about a challenge trial: the idea you could have a disease with an in-the-wild CFR of ~1%, that you could put 500 people through a challenge trial, and "very likely" none of them would die. With a CFR of 1%, expected fatalities among 500 people is 5. If medical observation and all the other precautions applied during a challenge trial reduces the CFR by a factor of 10, to 0.1%, your expected deaths is only 0.5, but that still seems unacceptably high for one trial, to me? To get the joint probability of zero deaths across all 500 people above 95%, you need closer to 0.01% CFR, . Is it realistic to think all the precautions in a challenge trial can reduce CFR by a factor of 100 from 1% to 0.01%? I have no idea, perhaps you do, but I'd want to know before being feeling personally comfortable with a challenge trial.

Regaring R values and monkeypox generally, my understanding on this topic doesn't go much beyond this post and the group of responses to it, so I'm pretty low-confidence on anything here. Thus, if you say the R is potentially quite high, I believe you. 

I do have additional uncertainty about R. From public reports about the means of transmission that [say]( things like

Monkeypox virus is transmitted from one person to another by close contact with lesions, body fluids, respiratory droplets and contaminated materials such as bedding.

I'd have to guess it's going to be less infectious than covid, which had an R around 5? On the other hand, since OP asked the question, there's more speculation about chains of transmission that seem to indicate a higher R. I acknowledge "lower than 5" is a high error!

Having said that, to my mind, I now feel very conflicted. Having read AllAmericanBreakfast's comment and their headline, I felt reassured that monkeypox wasn't much for the public to be worried about, and the CDC and WHO would figure it out. But on my own understanding, if R is high (as you say) and CFR is anywhere much above 0.1%, and there's a widespread outbreak, that is pretty scary and we should all be much more on the alert than we already are?

And that would affirm your conclusion that challenge trials would be a good idea, as long as we have confidence the risk to participants is low.

Comment by Ben Smith (ben-smith) on What's up with the recent monkeypox cases? · 2022-05-20T22:16:48.692Z · LW · GW

I agree that this is probably an overreaction.

I don't think challenge trials are warranted. There's real harm arising from doing challenge trials. They made sense for Covid because hundreds of millions of people caught it, thousands were dying every day, and getting an effective vaccine or treatment just one day sooner could save thousands of lives. So accepting a level of harm during testing is warranted. For a disease where R seems to be not much above 1, but CFR might be as high as 10%, I would say, even if we had a competent and well-funded pandemic prevention authority, they might pass on the challenge trials this time around.

Comment by Ben Smith (ben-smith) on Are our community grouphouses typically rented, or owned? · 2022-03-03T03:25:24.781Z · LW · GW

maybe everyone just rents. It would be the path of least resistance and you can make some arguments about the benefits of dynamism


The obvious benefit is low-start-up capital--all you need is a security deposit (bond). And the dynamism you mentioned is also pretty relevant. I was going to say more, but on second thoughts, I'd just ask: why don't you find those reasons alone compelling?

Alternately, maybe one person owns the house, and everyone else pays rent to them

The thing is, owning a house is a financial commitment and a timesink. You have to manage all the repairs and such yourself. You have to keep paying the mortgage, and you can't just end the lease. You have to sell the house, and that will typically cost you a low five-figure sum, depending on the location.

I know one property investor who got started by buying large houses, starting a group house at the house, and then repeating this process several times. Each group house had a fairly distinct vibe to it (none were rationalist themed). There was a lot of hassle in terms of getting involved in various personal disputes between residents but it worked out well for that investor in the long term.

I don't super recommend this as a way to invest in property, unless you BOTH (a) want to start a group house AND (b) have a group of specific individual residents in mind, or at least a target market with plenty of people who'd want to join your house in your particular location. For a "rationalist" themed house, you'd probably want to get started once you have specific people interested in moving in, unless you're in a large enough city that you can be confident of just finding rationalists when you advertise. If you want to try, but are open to relaxing your "rationalist" theme, it is probably a good long-term investment in most places.

Comment by Ben Smith (ben-smith) on Are our community grouphouses typically rented, or owned? · 2022-03-03T03:17:02.452Z · LW · GW

Owning a house doesn't give you fewer ongoing costs. It tends to give you less costs overall, but that's heavily contingent on rental and mortgage rates. And it's actually more administrative hassle, because you have to spend money on rates (local property taxes), repairs, and so on. The main thing owning a house gives you is it gives you is stability in terms of predicting future price changes.

Comment by Ben Smith (ben-smith) on A brief review of the reasons multi-objective RL could be important in AI Safety Research · 2021-10-12T00:37:54.393Z · LW · GW

The only resource I'd recommend, beyond MODEM, when that's back up, and our upcoming JAMAAS special issue, is to check out Elicit, Ought's GPT-3-based AI lit search engine (yes, they're teaching GPT-3 about how to create a superintelligent AI. hmm). It's in beta, but if they waitlist you and don't accept you in, email me and I'll suggest they add you. I wouldn't say it'll necessarily show you research you're not aware of, but I found it very useful for getting into the AI Alignment literature for the first time myself.

Comment by Ben Smith (ben-smith) on A brief review of the reasons multi-objective RL could be important in AI Safety Research · 2021-10-12T00:29:36.064Z · LW · GW

That's right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.

I guess whether one calls this "multi-objective RL" is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it "single objective RL with a complicated objective". If you combined objectives during reward, then I could call it that.

re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven't thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of "MORL" (if we can call it that) is the only or even the best sort of MORL. I'm just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context--that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.

Comment by Ben Smith (ben-smith) on Why I am not currently working on the AAMLS agenda · 2021-09-18T02:45:30.262Z · LW · GW

Interesting comments, thanks. Currently exploring an agenda of my own and this is food for thought.

Comment by Ben Smith (ben-smith) on Signaling Virtuous Victimhood as Indicators of Dark Triad Personalities · 2021-08-30T01:48:58.764Z · LW · GW

I know it's a touchy topic. In my defense, the research is solid, published in social psychology's top journal. I suppose the study deals with rhetoric in a political context. This community has a long history of drawing on social and cognitive psychological research to understand fallacies of thought and rhetoric (HPMOR), and I posted in that tradition. Apologies if I have strayed a little too far into a politicized area.

One needn't see this study as a shot at any particular political side--I can imagine people engaging 'virtuous-victimhood-signalling' within a wide range of different politicized narratives, as well as in completely apolitical contexts.

It also shouldn't be read to delegitimize victims from speaking out about their perspective! But perhaps it does provide evidence that sympathy can be weaponized in rhetorical conflict. We can all recognize this in political opponents and be blind to it amongst political allies.

Comment by Ben Smith (ben-smith) on Supplement to "Big picture of phasic dopamine" · 2021-07-07T08:27:59.808Z · LW · GW

Interesting. Is it fair to say that Mollick's system is relatively more "serial" with fewer parallelisms at the subcortical level, whereas you're proposing a system that's much more "parallel" because there are separate systems doing analogous things at each level? I think that parallel arrangement is probably the thing I've learned most personally from reading your work. Maybe I just hadn't thought about it because I focus too much on valuation and PFC decision-making stuff and don't look broadly enough at movement and other systems.

Apropos of nothing, is there any role for the visual cortex within your system?

I too am puzzled about why some people talk about "mPFC" and others talk about "vmPFC". I focus on "vmPFC", mostly because that's what people in my field talk about. "vmPFC" focuses much more on valuation systems. Theoretically I guess "mPFC" would also include the dorsomedial prefrontal cortex, which includes the anterior cingulate cortex, I guess some systems related to executive control, perhaps response inhibition (although that's usually quite lateral), perhaps abstract processing. Tends to be a bit of a decision-making homunculous of sorts :/ And then there's the ACC, whose role in various things is fairly well defined.

So maybe authors who talk about the mPFC aren't as concerned about distinguishing value processing from all those other things.

Comment by Ben Smith (ben-smith) on Conservatism in neocortex-like AGIs · 2021-05-04T17:16:42.588Z · LW · GW

As you're aware, I'm very much exploring this approach using a multi-objective decision-making approach, with conservativism through only acting when an action is non-negative on the whole set of objective functions that an actor regards.

The alternative, Bayesian AGI approach is also worth thinking about too. A conservative Bayesian AGI might not need multiple objectives. For each action, it just needs a single probability distribution of outcomes. If there are multiple theories of how to translate consequences of its actions into its single utility function, each of those theories might be given some weight, and then they'd be combined into the probability distribution. Then a conservative Bayesian AGI only acts if an action's utility function doesn't exceed below zero. Or maybe there's always some remote possibility of going below zero, and programming this sort of behavior would be absolutely paralyising. In that case maybe we just make it loss-averse rather than strictly avoiding any possibility of a negative outcome.

Comment by Ben Smith (ben-smith) on Propinquity Cities So Far · 2020-11-25T04:13:22.893Z · LW · GW

Two examples come to mind:

Comment by Ben Smith (ben-smith) on Propinquity Cities So Far · 2020-11-25T04:08:54.277Z · LW · GW

in practice, similar proposals (that have actually been implemented, both in communist and nominally capitalist countries) have vastly underestimated the difficulty of this problem, leading to large problems that have made life harder for many people


Singapore and Hong Kong are two generally-capitalist cities that have employed largely government housing development of very dense, tall housing.

It worked REALLY well in capitalist, uber-wealthy Singapore (GDP per capita substantially higher than the USA). ~78% of Singaporeans live in housing developed by the Singapore Government's Housing and Development Board ( It works a bit less well in Hong Kong, but still remarkably well considering how many people are housed in the very small area available.

Comment by Ben Smith (ben-smith) on Propinquity Cities So Far · 2020-11-25T04:04:34.133Z · LW · GW

But just this point is a rabbit hole of questions in itself.

If we equipped every store with tracking devices that measured the amount of time spent by people visiting (!!!!), that might incentivize making products really hard to find in the store, or making really long lines, so people spend more time there.

If it's the raw number of times people spend visiting the store, I am sure there are ways to game that too ("visit us 10 times this week for 10% off your next purchase!"). There could be laws against that, but..