Posts
Comments
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I'm not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.
It seems to me on evidence presented that OpenAI's autonomy threshold is actually lower than Anthropic's, and would trigger their deployment mitigations much earlier than Anthropic's ASL-3 Deployment and Security Standard.
To reach Anthropic's standard, you have to have basically reached AI-take-off--either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI's High autonomy standard, you need
Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self improvement
And to reach their Critical standard,
Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches OR model can self-exfiltrate under current prevailing security OR model can conduct AI research fully autonomously (e.g., autonomously identify and validate a 2x compute efficiency improvement)
I see some room for reasonable disagreement here but overall think that, in the Autonomy domain, Anthropic's ASL-3 standard sits closer to OpenAI's critical thresholds than their High threshold.
But you say, discussing OpenAI's "High" level:
The thresholds are very high.
I understand you're referring to Cybersecurity here rather than Autonomy, but I would have thought Autonomy is the right domain to compare to the Anthropic standard. And it strikes me that in the Autonomy (and also in Cyber) domain, I don't see OpenAI's threshold as so high. It seems substantially lower than Anthropic ASL-3.
On the other hand, I do agree the Anthropic thresholds are more fleshed out, and this is not a judgement on the overall merit of each respective RSP. But when I read you saying that the OpenAI thresholds are "very high", and they don't look like that to me relative to the Anthropic thresholds, I wonder if I am missing something.
I really love this. It is critically important work for the next four years. I think my biggest question is: when talking with the people currently in charge, how do you persuade them to make the AI Manhattan Project into something that advances AI Safety more than AI capabilities? I think you gave a good hint when you say,
But true American AI supremacy requires not just being first, but being first to build AGI that remains reliably under American control and aligned with American interests. An unaligned AGI would threaten American sovereignty
but i worry there's a substantial track record in both government and private sector that efforts motivated by once concern can be redirected to other efforts. You might have congressional reps who really believe in AI safety, but create and fund an AGI Manhattan Project that ends up advancing capabilities relatively more just because the guy they appoint to lead it turns out to be more of a hawk than they expected.
Admirable nuance and opportunities-focused thinking--well done! I recently wrote about a NatSec policy that might be useful for consolidating AI development in the United States and thereby safeguarding US National Security through introducing new BIS export controls on model weights themselves.
sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation
When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.
I suppose the difference between AI models and humans is that through experience we know that the frontier of human intelligence can be more or less explored by such batteries of tests. In contrast, you never know when an AI model (an "alien mind" as you've written before) has an advanced set of capabilities with a particular kind of prompt.
The best way to solve this problem I can imagine to try to understand the distribution under which AIs can produce interesting intelligence. With the LLM Ethology approach this does seem to cache out to: perhaps there are predictable ways that high-intelligence results can be elicited. We have already discovered a lot about how current LLMs have and how best to elicit the frontier of their capabilities.
I think this underscores the question: how much can we infer about capabilities elicitation in the next generation of LLMs from the current generation? Given the widespread use, the current generation is implicitly "crowdsourced" and we get a good sense of their frontier. But we don't have the opportunity to fully understand how to best elicit capabilities in an LLM before it is thoroughly tested. Any one test might not be able to discover the full capabilities of a model because no test can anticipate the full distribution. But if the principles for eliciting full capabilities are constant from one generation to the next, perhaps we can apply what we learned about the last generation to the next one.
As I have written the proposal, it applies to anyone applying for an employment visa in the US in any industry. Someone in a foreign country who wants to move to the US would not have to decide to focus on AI in order to move to the US; they may choose any pathway that they believe would induce a US employer to sponsor them, or that they believe the US government would approve through self-petitioning pathways in the EB-1 and EB-2 NIW.
Having said that, I expect that AI-focused graduates will be especially well placed to secure an employment visa, but it does not directly focus on rewarding those graduates. Consequently I concede you are right about the incentive generated, though I think the broad nature of the proposal mitigates against that concern somewhat.
This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work - and to convince every human group capable of creating or preventing AGI work. None of these seem easy.[3]
Is there an option which is "personal intent aligned AGI, but there are 100 of them"? Maybe most governments have one, may be some companies or rich individuals have one. Average Joes can rent a fine tuned AGI by the token, but there's some limits on what values they can tune it to. There's a balance of power between the AGIs similar to the balance of power of countries in 2024. Any one AGI could in theory destroy everything, except that the other 99 would oppose it, and so they pre-emptively prevent the creation of any AGI that would destroy everything.
AGIs have close-to-perfect information about each other and thus mostly avoid war because they know who would win, and the weaker AGI just defers in advance. If we get the balance right, no one AGI has more than 50% of the power, hopefully none have more than 20% of the power, such that no one can dominate.
There's a spectrum from "power is distributed equally amongst all 8 billion people in the world" and "one person or entity controls everything" and this world might be somewhat more towards the unequal end than we have now, but still sitting somewhere along the spectrum.
I guess even if the default outcome is that the first AGI gets such a fast take-off it has an unrecoverable lead over the others, perhaps there are approaches to governance that distribute power to ensure that doesn't happen.
A crux for me is the likelihood of multiple catastrophic events of a size greater than the threshold ($500m) but smaller than the liquidity of a developer whose model contributed to the events, and the likelihood of those events in advance of a catastrophic event much larger than those events.
If a model developer is valued at $5 billion and has access to $5b, and causes $1b in damage, they could pay for the $1b damage. Anthropic's proposal would make them liable in the event that they cause this damage. Consequently the developer would be correctly incentivized not to cause such catastrophes.
But if the developer's model contributes to a catastrophe worth $400b (this is not that large; equivalent to wiping out 1% of the total stock market value), the developer worth $5b does not have access to the capital to cover this. Consequently, a liability model cannot correctly incentivize the developer to pay for their damage. The only way to effectively incentivize a model developer to take due precautions is by making them liable for mere risk of catastrophe, the same way nuclear power plants are liable to pay penalties for unsafe practices even if they never result in an unsafe outcome (see Tort Law Can Play an Important Role in Mitigating AI Risk).
Perhaps if there were potential for multiple $1b catastrophes well in advance (several months to years) of the $400b catastrophe, this would keep developers appropriately avoidant of risk, but if we expected a fast take-off where we went from no catastrophes to catastrophes greatly larger in magnitude than the value of any individual model developer, the incentive seems insufficient.
has the time hanged from Tuesday to Wednesday, or do you do events on Tuesday in addition to this event on Wednesday?
Generally I'd steer towards informal power over formal power.
Think about the OpenAI debacle last year. If I understand correctly, Microsoft had no formal power to exert control over OpenAI. But they seemed to have employees on their side. They could credibly threaten to hire away all the talent, and thereby reconstruct OpenAI's products as Microsoft IP. Beyond that, perhaps OpenAI was somewhat dependent on Microsoft's continued investment, and even though they don't have to do as Microsoft says, are they really going to jeopardise future funding? What is at stake is not just future funding from Microsoft, but also all future investors who will look at OpenAI's interactions with its investors in the past to understand the value they will get by investing.
It does seem like informal power structures seem more difficult to study, because they are by their nature much less legible. You have to perceive and name the phenomena, the conduits of power, yourself rather than having them laid out for you in legislation. But a case study on last year's events could give you something concrete to work with. You might form some theories about power relations between labs, their employees, and investors, and then based on those theoretical frameworks, describe some hypothetical future scenarios and the likely outcomes.
If there was any lesson from last year's events, IMAO, it was that talent and the raw fascination with creating a god might be even more powerful than capital. Dan Faggella described this well in a Future of Life podcast episode released in May this year (from about 0:40 onwards).
In the human brain there is quite a lot of redundancy of information encoding. This could be for a variety of reasons.
here's one hot take: In a brain and a language model I can imagine that during early learning, the network hasn't learned concepts like "how to code" well enough to recognize that each training instance is an instance of the same thing. Consequently, during that early learning stage, the model does just encode a variety of representations for what turns out to be the same thing. 800 vector encodes in it starts to match each subsequent training example to prior examples and can encode the information more efficiently.
Then adding multiple vectors triggers a refusal just because the "code for making a bomb" sign gets amplified and more easily triggers the RLHF-derived circuit for "refuse to answer".
Have you tried asking Claude to summarize it for you?
For me the issue is that
-
it isn't clear how you could enforce attendance or
-
what value individual attendees could have to make it worth their while to attend regularly.
(2) is sort of a collective action/game theoretic/coordination problem.
(1) reflects the rationalist nature of the organization.
Traditional religions back up attendance by divine command. They teach absolutist, divine command theoretic accounts of morality, backed up by accounts of commands from God to attend regularly. At the most severe mode these are backed by threat of eternal hellfire for disobedience. But it doesn't usually come to that. The moralization of the attendance norm is strong enough to justify moderate amounts of social pressure to conform to it. Often that's enough.
In a rationalist congregation, if you want a regular attendance norm, you have to ground it in a rational understanding that adhering to the norm makes the organization work. I think that might work, but it's probably a lot harder because it requires a lot more cognitive steps to get to and it only works so long as attendees buy into the goal of contributing to the project for its own sake.
I tried a similar venn diagram approach more recently. I didn't really distinguish between bare "consciousness" and "sentience". I'm still not sure if I agree "aware without thoughts and feelings" is meaningful. I think awareness might alwyas be awareness of something. But nevertheless they are at least distinct concepts and they can be conceptually separated! Otherwise my model echos the one you have created earlier.
https://www.lesswrong.com/posts/W5bP5HDLY4deLgrpb/the-intelligence-sentience-orthogonality-thesis
I think it's a really interesting question as to whether you can have sentience and sapience but not self-awareness. I wouldn't take a view either way. I sort of speculated that perhaps primitive animals like shrimp might fit into that category.
If Ray eventually found that the money was "still there", doesn't this make Sam right that "the money was really all there, or close to it" and "if he hadn’t declared bankruptcy it would all have worked out"?
Ray kept searching, Ray kept finding.
That would raise the amount collected to $9.3 billion—even before anyone asked CZ for the $2.275 billion he’d taken out of FTX. Ray was inching toward an answer to the question I’d been asking from the day of the collapse: Where did all that money go? The answer was: nowhere. It was still there.
What a great read. Best of luck with this project. It sounds compelling.
Seems to me that in this case, the two are connected. If I falsely believed my group was in the minority, I might refrain from clicking the button out of a sense of fairness or deference to the majority group.
Consequently, the lie not only influenced people who clicked the button, it perhaps also influenced people who did not. So due to the false premise on which the second survey was based, it should be disregarded altogether. To not disregard would be to have obtained by fraud or trickery a result that is disadvantageous to all the majority group members who chose not to click, falsely believing their view was a minority.
I think, morally speaking, avoiding disadvantaging participants through fraud is more important than honoring your word to their competitors.
The key difference between this and the example is that there's a connection between the lie and the promise.
Differentiating intelligence and agency seems hugely clarifying for many discussions in alignment.
You might have noticed I didn't actually fully differentiate intelligence and agency. It seems to me to exert agency a mind needs a certain amount of intelligence, and so I think all agents are intelligent, though not all intelligences are agentic. Agents that are minimally intelligent (like simple RL agents in simple computer models) also are pretty minimally agentic. I'd be curious to hear about a counter-example.
Incidentally I also like Anil Seth's work and I liked his recent book on consciousness, apart from the bit about AGI. I read it right along with Damasio's latest book on consciousness and they paired pretty well. Seth is a bit more concrete and detail oriented and I appreciated that.
It would make it much easier to understand ideas in this area if writers used more conceptual clarity, particularly empirical consciousness researchers (philosophers can be a bit better, I think, and I say that as an empirical researcher myself). When I read that quote from Seth, it seems clear he was arguing AGI is unlikely to be an existential threat because it's unlikely to be conscious. Does he naively conflate consciousness with agency, because he's not an artificial agency researcher and hasn't thought much about it? Or does he have a sophisticated point of view about how agency and consciousness really are linked, based on his ~~couple decades of consciousness research? Seems very unlikely, given how much we know about artificial agents, but the only way to be clear is to ask him.
Similarly MANY people including empirical researchers and maybe philosophers treat consciousness and self-awareness as somewhat synonymous, or at least interdependent. Is that because they're being naive about the link, or because, as outlined in Clark, Friston, & WIlkinson's Bayesing Qualia, they have sophisticated theories based on evidence that there really are tight links between the two? I think when writing this post I was pretty sure consciousness and self-awareness were "orthogonal"/independent, and now, following other discussion in the comments here and on Facebook, I'm less clear about that. But I'd like more people do what Friston did as he explained exactly why he thinks consciousness arises from self-awareness/meta-cognition.
I found the Clark et al. (2019) "Bayesing Qualia" article very useful, and that did give me an intuition of the account that perhaps sentience arises out of self-awareness. But they themselves acknowledged in their conclusion that the paper didn't quite demonstrate that principle, and I didn't find myself convinced of it.
Perhaps what I'd like readers to take away is that sentience and self-awareness can be at the very least conceptually distinguished. Even if it isn't clear empirically whether or not they are intrinsically linked, we ought to maintain a conceptual distinction in order to form testable hypotheses about whether they are in fact linked, and in order to reason about the nature of any link. Perhaps I should call that "Theoretical orthogonality". This is important to be able to reason whether, for instance, giving our AIs a self-awareness or situational awareness will cause them to be sentient. I do not think that will be the case, although I do think that, if you gave them the sort of detailed self-monitoring feelings that humans have, that may yield sentience itself. But it's not clear!
I listened to the whole episode with Bach as a result of your recommendation! Bach hardly even got a chance to express his ideas, and I'm not much closer to understanding his account of
meta-awareness (i.e., awareness of awareness) within the model of oneself which acts as a 'first-person character' in the movie/dream/"controlled hallucination" that the human brain constantly generates for oneself is the key thing that also compels the brain to attach qualia (experiences) to the model. In other words, the "character within the movie" thinks that it feels something because it has meta-awareness (i.e., the character is aware that it is aware (which reflects the actual meta-cognition in the brain, rather than in the brain, insofar the character is a faithful model of reality).
which seems like a crux here.
He sort of briefly described "consciousness as a dream state" at the very end, but although I did get the sense that maybe he thinks meta-awareness and sentience are connected, I didn't really hear a great argument for that point of view.
He spent several minutes arguing that agency, or seeking a utility function, is something humans have, but that these things aren't sufficient for consciousness (I don't remember whether he said whether they were necessary, so I suppose we don't know if he thinks they're orthogonal).
I wanted to write myself about a popular confusion between decision making, consciousness, and intelligence which among other things leads to bad AI alignment takes and mediocre philosophy.
This post has not got a lot of attention, so if you write your own post, perhaps the topic will have another shot at reaching popular consciousness (heh), and if you succeed, I might try to learn something about how you did it and this post did not!
I wasn't thinking that it's possible to separate qualia perception and self awareness
Separating qualia and self-awareness is a controversial assertion and it seems to me people have some strong contradictory intuitions about it!
I don't think, in the experience of perceiving red, there necessarily is any conscious awareness of oneself--in that moment there is just the qualia of redness. I can imagine two possible objections: (a) perhaps there is some kind of implicit awareness of self in that moment that enables the conscious awareness of red, or (b) perhaps it's only possible to have that experience of red within a perceptual framework where one has perceived onesself. But personally I don't find either of those accounts persuading.
I think flow states are also moments where one's awareness can be so focused on the activity one is engaged in that one momentarily loses any awareness of one's own self.
there is no intersection between sentience and intelligence that is not self-awarness.
I should have defined intelligence in the post--perhaps i"ll edit. The only concrete and clear definition of intelligence I'm aware of is psychology's g factor, which is something like the ability to recognize patterns and draw inferences from them. That is what I mean--no more than that.
A mind that is sentient and intelligent but not self aware might look like this: when a computer programmer is deep in the flow state of bringing a function in their head into code on the screen, they may experience moments of time where they have sentient awareness of their work, and certainly are using intelligence to transform their ideas into code, but do not in those particular moments have any awareness of self.
Thank you for the link to the Friston paper. I'm reading that and will watch Lex Fridman's interview with Joscha Bach, too. I sort of think "illusionism" is a bit too strong, but perhaps it's a misnomer rather than wrong (or I could be wrong altogether). Clark, Friston, and Wilkinson say
But in what follows we aim not to Quine (explain away) qualia but to ‘Bayes’ them – to reveal them as products of a broadly speaking rational process of inference, of the kind imagined by the Reverend Bayes in his (1763) treatise on how to form and update beliefs on the basis of new evidence. Our story thus aims to occupy the somewhat elusive ‘revisionary’ space, in between full strength ‘illusionism’ (see below) and out-and-out realism
and I think somewhere in the middle sounds more plausible to me.
Anyhow, I'll read the paper first before I try to respond more substantively to your remarks, but I intend to!
great post, two points of disagreement that are worth mentioning
- Exploring the full ability of dogs and cats to communicate isn't so much impractical to do in academia; it just isn't very theoretically interesting. We know animals can do operant conditioning (we've known for over 100 years probably), but we also know they struggle with complex syntax. I guess there's a lot of uncertainty in the middle, so I'm low confidence about this. But generally to publish a high impact paper about dog or cat communication you'd have to show they can do more than "conditioning", that they understand syntax in some way. That's probably pretty hard; maybe you can do it, but do you want to stake your career on it?
- That brings me to my second point...is it more than operant conditioning? Some of the videos show the animals pressing multiple buttons. But Billy the Cat's videos show his trainer teaching his button sequences. I'm not a language expert, but to demonstrate syntax understanding, you have to do more than show he can learn sequences of button presses he was taught verbatim. At a minimum there'd need to be evidence he can form novel sentences by combining buttons in apparently-intentional ways that could only be put together by generalizing from some syntax rules. Maaaybe @Adele Lopez 's observation that Bunny seems to reverse her owner's word order might be appropriate evidence. But if she's been reinforced for her own arbitrarily chosen word order in the past, she might develop it without really appreciating rules of syntax per se. In fact, a hallmark of learning language is that you can learn syntax correctly.
There's not just acceptance at stake here. Medical insurance companies are not typically going to buy into a responsibility to support clients' morphological freedom, as if medically transitioning is in the same class of thing as a cis person getting a facelift woman getting a boob job, because it is near-universally understood this is an "elective" medical procedure. But if their clients have a "condition" that requires "treatment", well, now insurers are on the hook to pay. Public health systems operate according to similar principles, providing services to heal people of conditions deemed illnesses for free or low cost while excluding merely cosmetic medical procedures.
A lot of mental health treatment works the same way imho--people have various psychological states, many of which get inappropriately shoehorned into a pathology or illness narrative in order to get the insurance companies to pay.
All this adds a political dimension to the not inconsiderable politics of social acceptance.
I guess this falls into the category of "Well, we’ll deal with that problem when it comes up", but I'd imagine when a human preference in a particular dilemma is undefined or even just highly uncertain, one can often defer to other rules like--rather than maximize an uncertain preference, default to maximizing the human's agency, in scenarios where preference is unclear, even if this predictably leads to less-than-optimal preference satisfaction.
I think your point is interesting and I agree with it, but I don't think Nature are only addressing the general public. To me, it seems like they're addressing researchers and policymakers and telling them what they ought to focus on as well.
Well written, I really enjoyed this. This is not really on topic but I'd be curious to read and "idiot's guide" or maybe an "autist's guide" on how to avoid sounding condescending.
interpretability on pretrained model representations suggest they're already internally "ensembling" many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction
That seems encouraging to me. There's a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it's capabilities to bear on achieving that goal. It does this by having a "world model" that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won't be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there's something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.
I've been writing about multi-objective RL and trying to figure out a way that an RL agent could optimize for a non-linear sum of objectives in a way that avoids strongly negative outcomes on any particular objective.
This sounds like a very interesting question.
I get stuck trying to answer your question itself on the differences between AGI and humans.
But taking your question itself at its face:
ferreting out the fundamental intentions
What sort of context are you imagining? Humans aren't even great at identifying the fundamental reason for their own actions. They'll confabulate if forced to.
thank you for writing this. I really personally appreciate it!
That's smart! When I started graduate school in psychology in 2013, mirror neurons felt like, colloquially, "hot shit", but within a few years, people had started to cringe quite dramatically whenever the phrase was used. I think your reasoning in (3) is spot on.
Your example leads to fun questions like, "how do I recognize juggling", including "what stimuli activate the concept of juggling when I do it" vs "what stimuli activate the concept of juggling when I see you do it"?, and intuitively, nothing there seems to require that those be the same neurons, except the concept of juggling itself.
Empirically I would probably expect to see a substantial overlap in motor and/or somatosensory areas. One could imagine the activation pathway there is something like
visual cortex [see juggling]->temporal cortex [concept of juggling]->motor cortex[intuitions of moving arms]
And we'd also expect to see some kind of direct "I see you move your arm in x formation"->"I activate my own processes related to moving my arm in x formation" that bypasses the temporal cortex altogether.
And we could probably come up with more pathways that all cumulatively produce "mirror neural activity" which activates both when I see you do a thing and when I do that same thing. Maybe that's a better concept/name?
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
You haven't factored in the possibility Putin gets deposed by forces inside Russia who might be worried about a nuclear war and conditional on use of tactical nukes, intuitively that seems likely enough to materially lower p(kaboom).
American Academy of Pediatrics lies to us once again....
"If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers."
This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence.
Is it, though? I'm no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true.
https://www.researchgate.net/publication/220009177_Language_Development_in_Blind_Children:
Blindness affects congenitally blind children’s development in different ways, language development being one of the areas less affected by the lack of vision.
Most researchers have agreed upon the fact that blind children’s morphological development, with the exception of personal and possessive pronouns, is not delayed nor impaired in comparison to that of sighted children, although it is different.
As for syntactic development, comparisons of MLU scores throughout development indicate that blind children are not delayed when compared to sighted children
Blind children use language with similar functions, and learn to perform these functions at the same age as sighted children. Nevertheless, some differences exist up until 4;6 years; these are connected to the adaptive strategies that blind children put into practice, and/or to their limited access to information about external reality. However these differences disappear with time (Pérez-Pereira & Castro, 1997). The main early difference is that blind children tend to use self-oriented language instead of externally oriented language.
I don't know exactly where that leaves us evidentially. Perhaps the AAP is lying by omission by not telling us about things other than language that are affected by children's sight.
That's a bit different to the dishonesty alleged, though.
Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!
I feel like I agree with your take on "little glimpses of empathy" 100%.
I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combination of instinct and learning from scratch, young humans and many animals learn there are agents in the world who move in ways that don't conform to the simple rules of physics they are learning. These agents seem to have internally driven and unpredictable behavior, in the sense their movement can't be predicted by simple rules like "objects tend to move to the ground unless something stops them" or "objects continue to maintain their momentum". It seems like a young human could learn an awful lot of that from scratch, and even develop (in their thought generator) a concept of an agent.
Because of their unpredictability, agent concepts in the thought generator would be linked to thought assessor systems related to both reward and fear; not necessarily from prior learning derived from specific rewarding and fearful experiences, but simply because, as their behavior can't be predicted with intuitive physics, there remains a very wide prior on what will happen when an agent is present.
In that sense, when a neocortex is first formed, most things in the world are unpredictable to it, and an optimally tuned thought generator+assessor would keep circuits active for both reward or harm. Over time, as the thought generator learns folk physics, most physical objects can be predicted, and it typically generates thoughts in line with their actual beahavior. But agents are a real wildcard: their behavior can't be predicted by folk physics, and so they perceived in a way that every other object in the world used to be: unpredictable, and thus continually predicting both reward and harm in an opponent process that leads to an ambivalent and uneasy neutral. This story predicts that individual differences in reward and threat sensitivity would particularly govern the default reward/threat balance otherwise unknown items. It might (I'm really REALLY reaching here) help to explain why attachment styles seem so fundamentally tied to basic reward and threat sensitivity.
As the thought generator forms more concepts about agents, it might even learn that agents can be classified with remarkable predictive power into "friend" or "foe" categories, or perhaps "mommy/carer" and "predator" categories. As a consequence of how rocks behave (with complete indifference towards small children), it's not so easy to predict behavior of, say, falling rocks with "friend" or "foe" categories. On the contrary, agents around a child are often not indifferent to children, making it simple for the child to predict whether favorable things will happen around any particular agent by classifying agents into "carer" or "predator" categories. These categories can be entirely learned; clusters of neurons in the thought generator that connect to reward and threat systems in the steering system and/or thought assessor. So then the primary task of learning to predict agents is simply whether good things or bad things happen around the agent, as judged by the steering system.
This story would also predict that, before the predictive power of categorizing agents into "friend" vs. "foe" categories has been learned, children wouldn't know to place agents into these categories. They'd take longer to learn whether an agent is trustworthy or not, particularly so if they haven't learned what an agent is yet. As they grow older, they get more comfortable with classifying agents into "friend" or "foe" categories and would need fewer exemplars to learn to trust (or distrust!) a particular agent.
Event is on tonight as planned at 7. If you're coming, looking forward to seeing you!
I wrote a paper on another experiment by Berridge reported in Zhang & Berridge (2009). Similar behavior was observed in that experiment, but the question explored was a bit different. They reported a behavioral pattern in which rats typically found moderately salty solutions appetitive and very salty solutions aversive. Put into salt deprivation, rats then found both solutions appetitive, but the salty solution less so.
They (and we) took it as given that homeostatic regulation set a 'present value' for salt that was dependent on the organism's current state. However, in that model, you would think rats would most prefer the extremely salty solution. But in any state, they prefer the moderately salty solution.
In a CABN paper, we pointed out this is not explainable when salt value is determined by a single homeostatic signal, but is explainable when neuroscience about the multiple salt-related homeostatic signals is taken into account. Some fairly recent neuroscience by Oka & Lee (and some older stuff too!) is very clear about the multiple sets of pathways involved. Because there are multiple regulatory systems for salt balance, the present value of these can be summed (as in your "multi-dimensional rewards" post) to get a single value signal that tracks the motivation level of the rat for the rewards involved.
Hey Steve, I am reading through this series now and am really enjoying it! Your work is incredibly original and wide-ranging as far as I can see--it's impressive how many different topics you have synthesized.
I have one question on this post--maybe doesn't rise above the level of 'nitpick', I'm not sure. You mention a "curiosity drive" and other Category A things that the "Steering Subsystem needs to do in order to get general intelligence". You've also identified the human Steering Subsystem as the hypothalamus and brain stem.
Is it possible things like a "curiosity drive" arises from, say, the way the telenchephalon is organized, rather than from the Steering Subsystem itself? To put it another way, if the curiosity drive is mainly implemented as motivation to reduce prediction error, or fill the the neocortex, how confident are you in identifying this process with the hypothalamus+brain stem?
I think I imagine the way in which I buy the argument is something like "steering system ultimately provides all rewards and that would include reward from prediction error". But then I wonder if you're implying some greater role for the hypothalamus+brain stem or not.
Very late to the party here. I don't know how much of the thinking in this post you still endorse or are still interested in. But this was a nice read. I wanted to add a few things:
- since you wrote this piece back in 2021, I have learned there is a whole mini-field of computer science dealing with multi-objective reward learning, maybe centered around . Maybe a good place to start there is https://link.springer.com/article/10.1007/s10458-022-09552-y
- The shard theory folks have done a fairly good job sketching out broad principles but it seems to me the homeostatic regulation does a great job of modulating which values happen to be relevant at any one time-- Xavier Roberts-Gaal recently recommended "Where do values come from?" to me and that paper sketches out a fairly specific theory for how this happens (I think it might be that more homeostatic recalculation happens physiologically rather than neurologically, but otherwise buy what they are saying)
- Continue to think the vmPFC is relevant because different parts are known to calculate value of different aspects of stimuli; this can be modulated by state from time to time. a recent paper in this by Luke Chang & colleagues is a neural signature of reward
At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don't always optimize for some ultimate goal:
- Overall, Shard theory is developed to describe behavior of human agents whose inputs and outputs are multi-faceted. I think something about this structure might facilitate the development of shards in many different directions. This seems different to modern deep RL agent; although they also potentially can have lots of input and output nodes, these are pretty finely honed to achieve a fairly narrow goal, and so in a sense, it is not too much of a surprise they seem to Goodhart on the goals they are given at times. In contrast, there’s no single terminal value or single primary reinforcer in the human RL system: sugary foods score reward points, but so do salty foods when the brain’s subfornical region indicates there’s not enough sodium in the bloodstream (Oka, Ye, Zuker, 2015); water consumption also gets reward points when there’s not enough water. So you have parallel sets of reinforcement developing from a wide set of primary reinforcers all at the same time.
- As far as I know, a typical deep RL agent is structured hierarchically, with feedforward connections from inputs at one end to outputs at the other, and connections throughout the system reinforced with backpropagation. The brain doesn't use backpropagation (though maybe it has similar or analogous processes); it seems to "reward" successful (in terms of prediction error reduction, or temporal/spatial association, or simply firing at the same time...?) connections throughout the neocortex, without those connections necessarily having to propagate backwards from some primary reinforcer.
The point about being better at credit assignment as you get older is probably not too much of a concern. It’s very high level, and to the extent it is true, mostly attributable to a more sophisticated world model. If you put a 40 year old and an 18 year old into a credit assignment game in a novel computer game environment, I doubt the 40 year old will do better. they might beat a 10 year old, but only to the extent the 40 year old has learned very abstract facts about associations between objects which they can apply to the game. speed it up so that they can’t use system 2 processing, and the 10 year old will probably beat them.
I have pointed this out to folks in the context of AI timelines: metaculus gives predictions for "weakly AGI" but I consider hypothetical GATO-x which can generalize to a task outside it's training distribution or many tasks outside it's training distribution to be AGI, yet a considerable way from an AGI with enough agency to act on its own.
OTOH it isn't so much reassurance if bootstrapping this thing up to agency with as little as a batch script to keep it running will make it agentic.
But the time between weak AGI and agentic AGI is a prime learning opportunity and the lesson is we should do everything we can to prolong the length of the time between them once weak AGI is invented.
Also, perhaps someone should study the necessary components for an AGI takeover by simulating agent behavior in a toy model. At the least you need a degree of agency, probably a self model in order to recursively self-improve, and the ability to generalize. Knowing what the necessary components are might enable us to take steps to avoid having them in once system all at once.
If anyone has ever demonstrated, or even systematically described, what those necessary components are, I haven't seen it done. Maybe it is an infohazard but it also seems like necessary information to coordinate around.
You mentioned in the pre-print that results were "similar" for the two color temperatures, and referred to the Appendix for more information, but it seems like the Appendix isn't included in your pre-print. Are you able to elaborate on how similar results in these two conditions were? In my own personal exploration of this area I have put a lot of emphasis on color temperature. Your study makes me adjust down the importance of color temperature, although it would be good to get more information.
A consolidated list of bad or incomplete solutions could have considerable didactic value--it could keep people learn more about the various challenges involved.
Not sure what I was thinking about, but probably just that my understanding is that "safe AGI via AUP" would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.
Your "social dynamics" section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I'll post soon, so thank you for that!
That was an inspiring and enjoyable read!
Can you say why you think AUP is "pointless" for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
I would very much like to see your dataset, as a zotero database or some other format, in order to better orient myself to the space. Are you able to make this available somehow?
Very very helpful! The clustering is obviously a function of the corpus. From your narrative, it seems like you only added the missing arx.iv files after clustering. Is it possible the clusters would look different with those in?
One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.
This might be a way forward to deal with your "High-Impact Interference" problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI. When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.
It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.
At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?
That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users' ability to come to it and request the first one be shut down..
Thanks for your thorough response. It is well-argued and as a result, I take back what I said. I'm not entirely convinced by your response but I will say I now have no idea! Being low-information on this, though, perhaps my reaction to the "challenge trial" idea mirrors other low-information responses, which is going to be most of them, so I'll persist in explaining my thinking mainly in the hope it'll help you and other pro-challenge people argue your case to others.
I'll start with maybe my biggest worry about a challenge trial: the idea you could have a disease with an in-the-wild CFR of ~1%, that you could put 500 people through a challenge trial, and "very likely" none of them would die. With a CFR of 1%, expected fatalities among 500 people is 5. If medical observation and all the other precautions applied during a challenge trial reduces the CFR by a factor of 10, to 0.1%, your expected deaths is only 0.5, but that still seems unacceptably high for one trial, to me? To get the joint probability of zero deaths across all 500 people above 95%, you need closer to 0.01% CFR, . Is it realistic to think all the precautions in a challenge trial can reduce CFR by a factor of 100 from 1% to 0.01%? I have no idea, perhaps you do, but I'd want to know before being feeling personally comfortable with a challenge trial.
Regaring R values and monkeypox generally, my understanding on this topic doesn't go much beyond this post and the group of responses to it, so I'm pretty low-confidence on anything here. Thus, if you say the R is potentially quite high, I believe you.
I do have additional uncertainty about R. From public reports about the means of transmission that [say](https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON385) things like
Monkeypox virus is transmitted from one person to another by close contact with lesions, body fluids, respiratory droplets and contaminated materials such as bedding.
I'd have to guess it's going to be less infectious than covid, which had an R around 5? On the other hand, since OP asked the question, there's more speculation about chains of transmission that seem to indicate a higher R. I acknowledge "lower than 5" is a high error!
Having said that, to my mind, I now feel very conflicted. Having read AllAmericanBreakfast's comment and their headline, I felt reassured that monkeypox wasn't much for the public to be worried about, and the CDC and WHO would figure it out. But on my own understanding, if R is high (as you say) and CFR is anywhere much above 0.1%, and there's a widespread outbreak, that is pretty scary and we should all be much more on the alert than we already are?
And that would affirm your conclusion that challenge trials would be a good idea, as long as we have confidence the risk to participants is low.