Posts
Comments
Ah, gotcha. Yes, that seems reasonable.
I don't really understand why this is at all important. Do you expect (or endorse) users to... vote on posts solely by reading a list of titles without clicking on them to refresh their memories of what the posts are about, and as a natural corollary of this, see who the authors of the posts are? What's the purpose of introducing inconveniences and hiding information when this information will very likely be found anyway?
I get the importance of marginalist thinking, and of pondering what incentives you are creating for the median and/or marginal voting participant, blah blah blah, but if there is ever a spot on the Internet where superficiality is at its lowest and the focus is on the essence above the form, the LW review process might well be it.
In light of that, this question just doesn't seem (to a rather outside observer like me) worth pondering all that much.
The 3 most important paragraphs, extracted to save readers the trouble of clicking on a link:
The Anduril and OpenAI strategic partnership will focus on improving the nation’s counter-unmanned aircraft systems (CUAS) and their ability to detect, assess and respond to potentially lethal aerial threats in real-time.
[...]
The accelerating race between the United States and China to lead the world in advancing AI makes this a pivotal moment. If the United States cedes ground, we risk losing the technological edge that has underpinned our national security for decades.
[...]
These models, which will be trained on Anduril’s industry-leading library of data on CUAS threats and operations, will help protect U.S. and allied military personnel and ensure mission success.
I appreciate your response, and I understand that you are not arguing in favor of this perspective. Nevertheless, since you have posited it, I have decided to respond to it myself and expand upon why I ultimately disagree with it (or at the very least, why I remain uncomfortable with it because it doesn't seem to resolve my confusions).
I think revealed preferences show I am a huge fan of explanations of confusing questions that ultimately claim the concepts we are reifying are ultimately inconsistent/incoherent, and that instead of hitting our heads against the wall over and over, we should take a step back and ponder the topic at a more fundamental level first. So I am certainly open to the idea that “do I nonetheless continue living (in the sense of, say, anticipating the same kind of experiences)?” is a confused question.
But, as I see it, there are a ton of problems with applying this general approach in this particular case. First of all, if anticipated experiences are an ultimately incoherent concept that we cannot analyze without first (unjustifiably) reifying a theory-ladden framework, how precisely are we to proceed from an epistemological perspective? When the foundation of 'truth' (or at least, what I conceive of it to be) is based around comparing and contrasting what we expect to see with what we actually observe experimentally, doesn't the entire edifice collapse once the essential constituent piece of 'experiences' breaks down? Recall the classic (and eternally underappreciated) paragraph from Eliezer:
I pause. “Well . . .” I say slowly. “Frankly, I’m not entirely sure myself where this ‘reality’ business comes from. I can’t create my own reality in the lab, so I must not understand it yet. But occasionally I believe strongly that something is going to happen, and then something else happens instead. I need a name for whatever-it-is that determines my experimental results, so I call it ‘reality’. This ‘reality’ is somehow separate from even my very best hypotheses. Even when I have a simple hypothesis, strongly supported by all the evidence I know, sometimes I’m still surprised. So I need different names for the thingies that determine my predictions and the thingy that determines my experimental results. I call the former thingies ‘belief,’ and the latter thingy ‘reality.’ ”
What exactly do we do once we give up on precisely pinpointing the phrases "I believe", "my [...] hypotheses", "surprised", "my predictions", etc.? Nihilism, attractive as it may be to some from a philosophical or 'contrarian coolness' perspective, is not decision-theoretically useful when you have problems to deal with and tasks to accomplish. Note that while Eliezer himself is not what he considers a logical positivist, I think I... might be?
I really don't understand what "best explanation", "true", or "exist" mean, as stand-alone words divorced from predictions about observations we might ultimately make about them.
This isn't just a semantic point, I think. If there are no observations we can make that ultimately reflect whether something exists in this (seems to me to be) free-floating sense, I don't understand what it can mean to have evidence for or against such a proposition. So I don't understand how I am even supposed to ever justifiably change my mind on this topic, even if I were to accept it as something worth discussing on the object-level.
Everything I believe, my whole theory of epistemology and everything else logically downstream of it (aka, virtually everything I believe), relies on the thesis (axiom, if you will) that there is a 'me' out there doing some sort of 'prediction + observation + updating' in response to stimuli from the outside world. I get that this might be like reifying ghosts in a Wentworthian sense when you drill down on it, but I still have desires about the world, dammit, even if they don't make coherent sense as concepts! And I want them to be fulfilled regardless.
And, moreover, one of those preferences is maintaining a coherent flow of existence, avoiding changes that would be tantamount to death (even if they are not as literal as 'someone blows my brains out'). As a human being, I have preferences over what I experience too, not just over what state the random excitations of quantum fields in the Universe are at some point past my expiration date. As far as I see, the hard problem of consciousness (i.e., the nature of qualia) has not been close to solved; any answer to it would have to give me a practical handbook for answering the initial questions I posed to jbash.
Edit: This comment misinterpreted the intended meaning of the post.
Practical CF, more explicitly: A simulation of a human brain on a classical computer, capturing the dynamics of the brain on some coarse-grained level of abstraction, that can run on a computer small and light enough to fit on the surface of Earth, with the simulation running at the same speed as base reality, would cause the same conscious experience as that brain, in the specific sense of thinking literally the exact same sequence of thoughts in the exact same order, in perpetuity.
I... don't think this is necessarily what @EuanMcLean meant? At the risk of conflating his own perspective and ambivalence on this issue with my own, this is a question of personal identity and whether the computationalist perspective, generally considered a "reasonable enough" assumption to almost never be argued for explicitly on LW, is correct. As I wrote a while ago on Rob's post:
As TAG has written a number of times, the computationalist thesis seems not to have been convincingly (or even concretely) argued for in any LessWrong post or sequence (including Eliezer's Sequences). What has been argued for, over and over again, is physicalism, and then more and more rejections of dualist conceptions of souls.
That's perfectly fine, but "souls don't exist and thus consciousness and identity must function on top of a physical substrate" is very different from "the identity of a being is given by the abstract classical computation performed by a particular (and reified) subset of the brain's electronic circuit," and the latter has never been given compelling explanations or evidence. This is despite the fact that the particular conclusions that have become part of the ethos of LW about stuff like brain emulation, cryonics etc are necessarily reliant on the latter, not the former.
As a general matter, accepting physicalism as correct would naturally lead one to the conclusion that what runs on top of the physical substrate works on the basis of... what is physically there (which, to the best of our current understanding, can be represented through Quantum Mechanical probability amplitudes), not what conclusions you draw from a mathematical model that abstracts away quantum randomness in favor of a classical picture, the entire brain structure in favor of (a slightly augmented version of) its connectome, and the entire chemical make-up of it in favor of its electrical connections. As I have mentioned, that is a mere model that represents a very lossy compression of what is going on; it is not the same as the real thing, and conflating the two is an error that has been going on here for far too long. Of course, it very well might be the case that Rob and the computationalists are right about these issues, but the explanation up to now should make it clear why it is on them to provide evidence for their conclusion.
I recognize you wrote in response to me a while ago that you "find these kinds of conversations to be very time-consuming and often not go anywhere." I understand this, and I sympathize to a large extent: I also find these discussions very tiresome, which became part of why I ultimately did not engage too much with some of the thought-provoking responses to the question I posed a few months back. So it's totally ok for us not to get into the weeds of this now (or at any point, really). Nevertheless, for the sake of it, I think the "everyday experience" thermostat example does not seem like an argument in favor of computationalism over physicalism-without-computationalism, since the primary generator of my intuition that my identity would be the same in that case is the literal physical continuity of my body throughout that process. I just don't think there is a "prosaic" (i.e., bodily-continuity-preserving) analogue or intuition pump to the case of WBE or similar stuff in this respect.
Anyway, in light of footnote 10 in the post ("The question of whether such a simulation contains consciousness at all, of any kind, is a broader discussion that pertains to a weaker version of CF that I will address later on in this sequence"), which to me draws an important distinction between a brain-simulation having some consciousness/identity versus having the same consciousness/identity as that of whatever (physically-instantiated) brain it draws from, I did want to say that this particular post seems focused on the latter and not the former, which seems quite decision-relevant to me:
jbash: These various ideas about identity don't seem to me to be things you can "prove" or "argue for". They're mostly just definitions that you adopt or don't adopt. Arguing about them is kind of pointless.
sunwillrise: I absolutely disagree. The basic question of "if I die but my brain gets scanned beforehand and emulated, do I nonetheless continue living (in the sense of, say, anticipating the same kinds of experiences)?" seems the complete opposite of pointless, and the kind of conundrum in which agreeing or disagreeing with computationalism leads to completely different answers.
Perhaps there is a meaningful linguistic/semantic component to this, but in the example above, it seems understanding the nature of identity is decision-theoretically relevant for how one should think about whether WBE would be good or bad (in this particular respect, at least).
All of these ideas sound awesome and exciting, and precisely the right kind of use of LLMs that I would like to see on LW!
It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago
I'm not sure I like this argument very much, as it currently stands. It's not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.
Yudkowsky had (and, AFAICT, still has) a specific theory of human values in terms of what they mean in a reductionist framework, where it makes sense (and is rather natural) to think of (approximate) utility functions of humans and of Coherent Extrapolated Volition as things-that-exist-in-the-territory.
I think a lot of writing and analysis, summarized by me here, has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe "human values" in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints; I instead observe a tremendous number of unanswered questions and apparent contradictions that throw the entire edifice in disarray.
But supplementing this reorientation of thinking around what it means to satisfy human values has been "prosaic" alignment researchers pivoting more towards intent alignment as opposed to doomed-from-the-start paradigms like "learning the true human utility function" or ambitious value learning, a recognition that realism about (AGI) rationality is likely just straight-up false and that the very specific set of conclusions MIRI-clustered alignment researchers have reached about what AGI cognition will be like are entirely overconfident and seem contradicted by our modern observations of LLMs, and ultimately an increased focus on the basic observation that full value alignment simply is not required for a good AI outcome (or at the very least to prevent AI takeover). So it's not so much that human values (to the extent such a thing makes sense) are simpler, but more so that fulfilling those values is just not needed to nearly as high a degree as people used to think.
Mainly, minecraft isn't actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you're right that you can't say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.
a lot of skill ceilings are much higher than you might think, and worth investing in
The former doesn't necessarily imply the latter in general, because even if we are systematically underestimating the realistic upper bound for our skill level in these areas, we would still have to deal with diminishing marginal returns to investing in any particular one. As a result, I am much more confident of the former claim being correct for the average LW reader than of the latter. In practice, my experience tells me that you often have "phase changes" of sorts, where there's a rather binary instead of continuous response to a skill level increase: either you've hit the activation energy level, and thus unlock the self-reinforcing loop of benefits that flow from the skill (once you can apply it properly and iterate on it or use it recursively), or you haven't, in which case any measurable improvement is minimal. It's thus often more important to get past the critical point than to make marginal improvements either before or after hitting it.
On the other hand, many of the skills you mentioned afterwards in your comment seem relatively general-purpose, so I could totally be off-base in these specific cases.
The document seems to try to argue that Uber cannot possibly become profitable. I would be happy to take a bet that Uber will become profitable within the next 5 years.
This is an otherwise valuable discussion that I'd rather not have on LW, for the standard reasons; it seems a bit to close to the partisan side of the policy/partisanship political discussion divide. I recognize I wrote a comment in reaction to yours (shame on me), and so you were fully within your rights to respond, but I'd rather stop it here.
But rather that if he doesn't do it, it will be because he doesn't want to, not because his constituents don't.
I generally prefer not to dive into the details of partisan politics on LW, but my reading of the comment you are responding to makes me believe that, by "Republicans under his watch", ChristianKl is referring to Republican politicians/executive appointees and not to Republican voters.
I am not saying I agree with this perspective, just that it seems to make a bit more sense to me in context. The idea would be that Trump has been able to use "leadership" to remake the Republican party in his image and get the party elites to support him only because he has mostly governed as a standard conservative Republican on economic issues (tax cuts for rich people&corporations, attempts to repeal the ACA, deregulation, etc); the symbiotic relationship they enjoy would therefore supposedly have as a prerequisite the idea that Trump would not try to enforce idiosyncratic views on other Republicans too much...
Tabooing words is bad if, by tabooing, you are denying your interlocutors the ability to accurately express the concepts in their minds.
We can split situations where miscommunication about the meaning of words persists despites repeated attempts by all sides to resolve it into three broad categories. On the one hand, you have those that come about because of (explicit or implicit) definitional disputes, such as the famous debate (mentioned in the Sequences) over whether trees that fall make sounds if nobody is around to hear them. Different people might have give different responses (literally 'yes' vs 'no' in this case), but this is simply because they interpret the words involved differently. When you replace the symbol with the substance, you realize that there is no empirical difference in what anticipated experiences the two sides have, and thus the entire debate is revealed to be a waste of time. By dissolving the question, you have resolved it.
That does not capture the entire cluster of persistent semantic disagreements, however, because there is one other possible upstream generator of controversy, namely the fact that, often times, the concept itself is confused. This often comes about because one side (or both) reifies or gives undue consideration to a mental construct that does not correspond to reality; perhaps the concept of 'justice' is an example of this, or the notion of observer-independent morality (if you subscribe to either moral anti-realism or to the non-mainstream Yudkowskian conception of realism). In this case, it is generally worthwhile to spend the time necessary to bring everyone on the same page that the concept itself should be abandoned and we should avoid trying to make sense of reality through frameworks that include it.
But, sometimes, we talk about confusing concepts not because the concepts themselves ultimately do not make sense in the territory (as opposed to our fallible maps), but because we simply lack the gears-level understanding required to make sense of our first-person, sensual experiences. All we can do is bumble around, trying to gesture at what we are confused about (like consciousness, qualia, etc), without the ability to pin it down with surgical precision. Not because our language is inadequate, not[1] because the concept we are honing in on is inherently nonsensical, but because we are like cavemen trying to reason about the nature of the stars in the sky. The stars are not an illusion, but our hypotheses about them ('they are Gods' or 'they are fallen warriors' etc) are completely incompatible with reality. Not due to language barriers or anything like that, but because we lack the large foundation and body of knowledge needed to even orient ourselves properly around them.
To me, consciousness falls into the third category. If you taboo too many of the most natural, intuitive ways of talking about it, you are not benefitting from a more careful and precise discussion of the concepts involved. On the contrary, you are instead forcing people who lack the necessary subject-matter knowledge (i.e., arguably all of us) to make up their own hypotheses about how it functions. Of course they will come to different conclusions; after all, the hard problem of consciousness is still far from being settled!
- ^
At least not necessarily because of this; you can certainly take an illusionistic perspective on the nature of consciousness.
I agree, but I think this is slightly beside the original points I wanted to make.
I continue to strongly believe that your previous post is methodologically dubious and does not provide an adequate set of explanations of what "humans believe in" when they say "consciousness." I think the results that you obtained from your surveys are ~ entirely noise generated by forcing people who lack the expertise necessary to have a gears-level model of consciousness (i.e., literally all people in existence now or in the past) to talk about consciousness as though they did, by denying them the ability to express themselves using the language that represents their intuitions best.
Normally, I wouldn't harp on that too much here given the passage of time (water under the bridge and all that), but literally this entire post is based on a framework I believe gets things totally backwards. Moreover, I was very (negatively) surprised to see respected users on this site apparently believing your previous post was "outstanding" and "very legible evidence" in favor of your thesis.
I dearly hope this general structure does not become part of the LW zeitgeist for thinking about an issue as important as this.
arguments about risks from destabilization/power dynamics and potential conflicts between various actors are probably both more legible and 'truer'
Say more?
I think the general impression of people on LW is that multipolar scenarios and concerns over "which monkey finds the radioactive banana and drags it home" are in large part a driver of AI racing instead of being a potential impediment/solution to it. Individuals, companies, and nation-states justifiably believe that whichever one of them accesses potentially superhuman AGI first will have the capacity to flip the gameboard at-will, obtain power over the entire rest of the Earth, and destabilize the currently-existing system. Standard game theory explains the final inferential step for how this leads to full-on racing (see the recent U.S.-China Commission's report for a representative example of how this plays out in practice).
I get that we'd like to all recognize this problem and coordinate globally on finding solutions, by "mak[ing] coordinated steps away from Nash equilibria in lockstep". But I would first need to see an example, a prototype, of how this can play out in practice on an important and highly salient issue. Stuff like the Montreal Protocol banning CFCs doesn't count because the ban only happened once comparably profitable/efficient alternatives had already been designed; totally disanalogous to the spot we are in right now, where AGI will likely be incredibly economically profitable, perhaps orders of magnitude more so than the second-best alternative.
This is in large part why Eliezer often used to challenge readers and community members to ban gain-of-function research, as a trial run of sorts for how global coordination on pausing/slowing AI might go.
Hmm, that sounds about right based on the usual human-vs-human transfer from Elo difference to performance... but I am still not sure if that holds up when you have odds games, which feel qualitatively different to me than regular games. Based on my current chess intuition, I would expect the ability to win odds games to scale better than ELO near the top level, but I could be wrong about this.
rapid time controls
I am very skeptical of this on priors, for the record. I think this statement could be true for superblitz time controls and whatnot, but I would be shocked if knight odds would be enough to beat Magnus in a 10+0 or 15+0 game. That being said, I have no inside knowledge, and I would update a lot of my beliefs significantly if your statement as currently written actually ends up being true.
conditional on you aligning with the political priorities of OP funders or OP reputational management
Do you mean something more expansive than "literally don't pursue projects that are either conversative/Republican-coded or explicitly involved in expanding/enriching the Rationality community"? Which, to be clear, would be less-than-ideal if true, but should be talked about in more specific terms when giving advice to potential grant-receivers.
I get an overall vibe from many of the comments you've made recently about OP, both here and on the EA forum, that you believe in a rather broad sense they are acting to maximize their own reputation or whatever Dustin's whims are that day (and, consequently, lying/obfuscating this in their public communications to spin these decisions the opposite way), but I don't think[1] you have mentioned any specific details that go beyond their own dealings with Lightcone and with right-coded figures.
- ^
Could be a failure of my memory, ofc
One problem is that, at least in my experience, drunk people are much less fun to be around when you yourself are not also drunk. As a non-drinker who sometimes casually hangs out with drinkers, the drinking gap between us often makes the connection less potent.
rather than
You mean "in addition to", right? Knowing what the AI alone is capable of doing is quite an important part of what evals are about, so keeping it there seems crucial.
[Coming at this a few months late, sorry. This comment by @Steven Byrnes sparked my interest in this topic once again]
I don't see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.
Focusing on this argument seems like missing the forest for the trees.
Ngl, I find everything you're written here a bit... baffling, Seth. Your writing in particular and your exposition of your thoughts on AI risk generally does not use evolutionary analogies, but this only means that posts and comments criticizing analogies with evolution (sample: 1, 2, 3, 4, 5, etc) are just not aimed at you and your reasoning. I greatly enjoy reading your writing and pondering the insights you bring up, but you are simply not even close to the most publicly-salient proponent of "somewhat high " among the AI alignment community. It makes perfect sense from the perspective of those who disagree with you (or other, more hardcore "doomers") on the bottom-line question of AI to focus their public discourse primarily on responding to the arguments brought up by the subset of "doomers" who are most salient and also most extreme in their views, namely the MIRI-cluster centered around Eliezer, Nate Soares, and Rob Bensinger.
And when you turn to MIRI and the views that its members have espoused on these topics, I am very surprised to hear that "The arguments for misgeneralization/mis-specification stand on their own" and are not ultimately based on analogies with evolution.
But anyway, to hopefully settle this once and for all, let's go through all the examples that pop up in my head immediately when I think of this, shall we?
From the section on inner & outer alignment of "AGI Ruin: A List of Lethalities", by Yudkowsky (I have removed the original emphasis and added my own):
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence. We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously. (People will perhaps rationalize reasons why this abstract description doesn't carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned 'lethally' dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
[...]
21. There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.
From "A central AI alignment problem: capabilities generalization, and the sharp left turn", by Nate Soares, which, by the way, quite literally uses the exact phrase "The central analogy"; as before, emphasis is mine:
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.
From "The basic reasons I expect AGI ruin", by Rob Bensinger:
When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems".
[...]
Human brains aren't perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.
[...]
Human brains underwent no direct optimization for STEM ability in our ancestral environment, beyond traits like "I can distinguish four objects in my visual field from five objects".[5]
[5] More generally, the sciences (and many other aspects of human life, like written language) are a very recent development on evolutionary timescales. So evolution has had very little time to refine and improve on our reasoning ability in many of the ways that matter
From "Niceness is unnatural", by Nate Soares:
I think this view is wrong, and I don't see much hope here. Here's a variety of propositions I believe that I think sharply contradict this view:
- There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
- The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".
- Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.
- The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.
From "Superintelligent AI is necessary for an amazing future, but far from sufficient", by Nate Soares:
These are the sorts of features of human evolutionary history that resulted in us caring (at least upon reflection) about a much more diverse range of minds than “my family”, “my coalitional allies”, or even “minds I could potentially trade with” or “minds that share roughly the same values and faculties as me”.
Humans today don’t treat a family member the same as a stranger, or a sufficiently-early-development human the same as a cephalopod; but our circle of concern is certainly vastly wider than it could have been, and it has widened further as we’ve grown in power and knowledge.
From the Eliezer-edited summary of "Ngo and Yudkowsky on alignment difficulty", by... Ngo and Yudkowsky:
Eliezer, summarized by Richard (continued): "In biological organisms, evolution is
one sourcethe ultimate source of consequentialism. Asecondsecondary outcome of evolution is reinforcement learning. For an animal like a cat, upon catching a mouse (or failing to do so) many parts of its brain get slightly updated, in a loop that makes it more likely to catch the mouse next time. (Note, however, that this process isn’t powerful enough to make the cat a pure consequentialist - rather, it has many individual traits that, when we view them from this lens, point in the same direction.)A third thing that makes humans in particular consequentialist is planning,Another outcome of evolution, which helps make humans in particular more consequentialist, is planning - especially when we’re aware of concepts like utility functions."
From "Comments on Carlsmith's “Is power-seeking AI an existential risk?"", by Nate Soares:
Perhaps Joe thinks that alignment is so easy that it can be solved in a short time window?
My main guess, though, is that Joe is coming at things from a different angle altogether, and one that seems foreign to me.
Attempts to generate such angles along with my corresponding responses:
- Claim: perhaps it's just not that hard to train an AI system to be "good" in the human sense? Like, maybe it wouldn't have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?
- Counter: Maybe? But I expect these sorts of things to take time, and at least some mastery of the system's internals, and if you want them to be done so well that they actually work in practice even across the great Change-Of-Distribution to operating in the real world then you've got to do a whole lot of clever and probably time-intensive work.
- Claim: perhaps there's just a handful of relevant insights, and new ways of thinking about things, that render the problem easy?
- Counter: Seems like wishful thinking to me, though perhaps I could go point-by-point through hopeful-to-Joe-seeming candidates?
From "Soares, Tallinn, and Yudkowsky discuss AGI cognition", by... well, you get the point:
Eliezer: Something like, "Evolution constructed a jet engine by accident because it wasn't particularly trying for high-speed flying and ran across a sophisticated organism that could be repurposed to a jet engine with a few alterations; a human industry would be gaining economic benefits from speed, so it would build unsophisticated propeller planes before sophisticated jet engines." It probably sounds more convincing if you start out with a very high prior against rapid scaling / discontinuity, such that any explanation of how that could be true based on an unseen feature of the cognitive landscape which would have been unobserved one way or the other during human evolution, sounds more like it's explaining something that ought to be true.
And why didn't evolution build propeller planes? Well, there'd be economic benefit from them to human manufacturers, but no fitness benefit from them to organisms, I suppose? Or no intermediate path leading to there, only an intermediate path leading to the actual jet engines observed.
I actually buy a weak version of the propeller-plane thesis based on my inside-view cognitive guesses (without particular faith in them as sure things), eg, GPT-3 is a paper airplane right there, and it's clear enough why biology could not have accessed GPT-3. But even conditional on this being true, I do not have the further particular faith that you can use propeller planes to double world GDP in 4 years, on a planet already containing jet engines, whose economy is mainly bottlenecked by the likes of the FDA rather than by vaccine invention times, before the propeller airplanes get scaled to jet airplanes.
The part where the whole line of reasoning gets to end with "And so we get huge, institution-reshaping amounts of economic progress before AGI is allowed to kill us!" is one that doesn't feel particular attractored to me, and so I'm not constantly checking my reasoning at every point to make sure it ends up there, and so it doesn't end up there.
From "Humans aren't fitness maximizers", by Soares:
One claim that is hopefully uncontroversial (but that I'll expand upon below anyway) is:
- Humans are not literally optimizing for IGF, and regularly trade other values off against IGF.
Separately, we have a stronger and more controversial claim:
- If an AI's objectives included goodness in the same way that our values include IGF, then the future would not be particularly good.
I think there's more room for argument here, and will provide some arguments.
A semi-related third claim that seems to come up when I have discussed this in person is:
- Niceness is not particularly canonical; AIs will not by default give humanity any significant fraction of the universe in the spirit of cooperation.
I endorse that point as well. It takes us somewhat further afield, and I don't plan to argue it here, but I might argue it later.
From "Shah and Yudkowsky on alignment failures", by the usual suspects:
Yudkowsky: and lest anyone start thinking that was an exhaustive list of fundamental problems, note the absence of, for example, "applying lots of optimization using an outer loss function doesn't necessarily get you something with a faithful internal cognitive representation of that loss function" aka "natural selection applied a ton of optimization power to humans using a very strict very simple criterion of 'inclusive genetic fitness' and got out things with no explicit representation of or desire towards 'inclusive genetic fitness' because that's what happens when you hill-climb and take wins in the order a simple search process through cognitive engines encounters those wins"
From the comments on "Late 2021 MIRI Conversations: AMA / Discussion", by Yudkowsky:
Yudkowsky: I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.
From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don't care about at all.
Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.
From Yudkowsky's appearance on the Bankless podcast (full transcript here):
Ice cream didn't exist in the natural environment, the ancestral environment, the environment of evolutionary adeptedness. There was nothing with that much sugar, salt, fat combined together as ice cream. We are not built to want ice cream. We were built to want strawberries, honey, a gazelle that you killed and cooked [...] We evolved to want those things, but then ice cream comes along, and it fits those taste buds better than anything that existed in the environment that we were optimized over.
[...]
Leaving that aside for a second, the reason why this metaphor breaks down is that although the humans are smarter than the chickens, we're not smarter than evolution, natural selection, cumulative optimization power over the last billion years and change. (You know, there's evolution before that but it's pretty slow, just, like, single-cell stuff.)
There are things that cows can do for us, that we cannot do for ourselves. In particular, make meat by eating grass. We’re smarter than the cows, but there's a thing that designed the cows; and we're faster than that thing, but we've been around for much less time. So we have not yet gotten to the point of redesigning the entire cow from scratch. And because of that, there's a purpose to keeping the cow around alive.
And humans, furthermore, being the kind of funny little creatures that we are — some people care about cows, some people care about chickens. They're trying to fight for the cows and chickens having a better life, given that they have to exist at all. And there's a long complicated story behind that. It's not simple, the way that humans ended up in that [??]. It has to do with the particular details of our evolutionary history, and unfortunately it's not just going to pop up out of nowhere.
But I'm drifting off topic here. The basic answer to the question "where does that analogy break down?" is that I expect the superintelligences to be able to do better than natural selection, not just better than the humans.
At this point, I'm tired, so I'm logging off. But I would bet a lot of money that I can find at least 3x the number of these examples if I had the energy to. As Alex Turner put it, it seems clear to me that, for a very high portion of "classic" alignment arguments about inner & outer alignment problems, at least in the form espoused by MIRI, the argumentative bedrock is ultimately based on little more than analogies with evolution.
We think it works like this
Who is "we"? Is it:
- only you and your team?
- the entire Apollo Research org?
- the majority of mechinterp researchers worldwide?
- some other group/category of people?
Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.
randomly
As an aside (that's still rather relevant, IMO), it is a huge pet peeve of mine when people use the word "randomly" in technical or semi-technical contexts (like this one) to mean "uniformly at random" instead of just "according to some probability distribution." I think the former elevates and reifies a way-too-common confusion and draws attention away from the important upstream generator of disagreements, namely how exactly the constitution is sampled.
I wouldn't normally have said this, but given your obvious interest in math, it's worth pointing out that the answers to these questions you have raised naturally depend very heavily on what distribution we would be drawing from. If we are talking about, again, a uniform distribution from "the design space of minds-in-general" (so we are just summoning a "random" demon or shoggoth), then we might expect one answer. If, however, the search is inherently biased towards a particular submanifold of that space, because of the very nature of how these AIs are trained/fine-tuned/analyzed/etc., then you could expect a different answer.
One of the advantages to remaining agnostic comes from the same argument that users put forth in the comment sections on this very site way back in the age of the Sequences (I can look up the specific links if people really want me to, they were in response to the Doublethink Sequence) for why it's not necessarily instrumentally rational for limited beings like humans to actually believe in the Litany of Tarski: if you are in a precarious social situation, in which retaining status/support/friends/resources is contingent on you successfully signaling to your in-group that you maintain faith in their core teachings, it simply doesn't suffice to say "acquire all the private truth through regular means and don't talk/signal publicly the stuff that would be most dangerous to you," because you don't get complete control over what you signal.
If you learn that the in-group is wrong about some critical matter, and you understand that in-group members realizing you no longer agree with them will result in harm to you (directly, or through your resources being cut off), your only option is, to act (to some extent) deceptively. To take on the role, QuirrellMort-style, of somebody who does not have access to the information you have actually stumbled upon, and to pretend to be just another happy & clueless member of the community.
This is capital-H Hard. Lying (or even something smaller-scale like lesser deceptions), when done consistently and routinely, to people that you consider(ed) your family/friends/acquaintances, is very hard for (the vast majority of) people. For straightforward evolutionary reasons, we have evolved to be really good at detecting when one of our own is not being fully forthcoming. You can bypass this obstacle if the number of interactions you have is small, or if, as is usually the case in modern life when people get away with lies, nobody actually cares about the lie and it's all just a game of make-believe where you just have to "utter the magic words." But when it's not a game, when people do care about honestly signaling your continued adherence to the group's beliefs and epistemology, you're in big trouble.
Indeed, by far the most efficient way of convincing others of your bullshit on a regular basis is to convince yourself first, and by putting yourself in a position where you must do the former, you are increasing the likelihood of the latter with every passing day. Quite the opposite of what you'd like to see happen, if you are about truth-seeking to any large extent.
(addendum: admittedly, this doesn't answer the question fully, since it doesn't deal with the critical distinction between agnosticism and explicit advocacy, but I think it does get at something reasonably important in the vicinity of it anyway)
There's an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; "look where I'm pointing, not at my finger"; Eliciting Latent Knowledge.
Another highly relevant post: The Pointers Problem.
So, where are the Knuths of the modern era? Why is modern AI dominated by the Lorem Epsoms of the world? Where is the craftsmanship? Why are our AI tools optimized for seeming good, rather than being good?
[2] Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days?
I'm a bit confused by your confusion, and by the fact that your post does not contain what seems to me like the most straightforward explanation of these phenomena. An explanation that I am almost fully certain you are aware of, and which seems to be almost universally agreed upon by those interested (at any level) in interpretability in ML.
Namely the fact that, starting in the 2010s, it happened to be the case (for a ton of historically contingent reasons) that top AI companies (at the beginning, and followed by other ML hubs and researchers afterwards) realized the bitter lesson is basically correct: attempts to hard-code human knowledge or intuition into frontier models ultimately always harm their performance in the long-term compared to "literally just scale the model with more data and compute." This led to a focus, among experts and top engineers, on figuring out scaling laws, ways of improving the quality and availability of data (perhaps through synthetic generation methods), ways of creating better end-user products through stuff like fine-tuning and RLHF, etc, instead of the older GOFAI stuff of trying to figure out at a deeper level what is going on inside the model.
Another way of saying this is that top researchers and companies ultimately stumbled on an AI paradigm which increased capabilities significantly more than had been achievable previously, but at the cost of strongly decoupling "capability improvements" and "interpretability improvements" as distinct things that researchers and engineers could focus on. It's not that capability and interpretability were necessarily tightly correlated in the past; that is not the claim I am making. Rather, I am saying that in the pre-(transformer + RL) era, the way you generated improvements in your models/AI was by figuring out specific issues and analyzing them deeply to find out how to get around them, whereas now, a far simpler, easier, less insight-intensive approach became available: literally just scaling up the model with more data and compute.
So the basic point is that you no longer see all this cool research on the internal representations that models generate of high-dimensional data like word embeddings (such as the word2vec stuff you are referring to in the second footnote) because you no longer have nearly as much of a need for these insights in order to improve the capabilities/performance of the AI tools currently in use. It's fundamentally an issue with demand, not with supply. And the demand from the interpretability-focused AI alignment community is just nowhere close to large enough to bridge the gap and cover the loss generated by the shift in paradigm focus and priorities among the capabilities/"normie" AI research community.
Indeed, the notion that nowadays, the reason you no longer have deep thinkers who try to figure out what is going on or are "motivated by reasons" in how they approach these issues, is somehow because "careful thinkers read LessWrong and decided against contributing to AI progress," seems... rather ridiculous to me? It's not like I enjoy responding to an important question that you are asking with derision in lieu of a substantive response, but... I mean, the literal authors of the word2vec paper you cited were AI (capabilities) researchers working at top companies, not AI alignment researchers! Sure, some people like Bengio and Hofstadter (less relevant in practical terms) who are obviously not "LARP-ing impostors" in Wentworth's terminology have made the shift from capabilities work to trying to raise public awareness of alignment/safety/control problems. But the vast majority (according to personal experience, general impressions, as well as the current state of the discourse on these topics) absolutely have not, and since they were the ones generating the clever insights back in the day, of course it makes sense that the overall supply of these insights has gone down.
I just really don't see how it could be the case that "people refuse to generate these insights because they have been convinced by AI safety advocates that it would dangerously increase capabilities and shorten timelines" and "people no longer generate these insights as much because they are instead focusing on other tasks that improve model capabilities more rapidly and robustly, given the shifted paradigm" are two hypotheses that can be given similar probabilities in any reasonable person's mind. The latter should be at least a few orders of magnitude more likely than the former, as I see it.
some people say that "winning is about not playing dominated strategies"
I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."
In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purportedly making these mistakes. If it really was that widely-used, surely it would be trivial for the authors to quote a few examples off the top of their head, no? What does it say about them that they didn't?
I think it's not quite as clear as needing to shut down all other AGI projects or we're doomed; a small number of AGIs under control of different humans might be stable with good communication and agreements, at least until someone malevolent or foolish enough gets involved.
Realistically, in order to have a reasonable degree of certainty that this state can be maintained for more than a trivial amount of time, this would, at the very least, require a hard ban on open-source AI, as well as international agreements to strictly enforce transparency and compute restrictions, with the direct use of force if need be, especially if governments get much more involved in AI in the near-term future (which I expect will happen).
Do you agree with this, as a baseline?
Does this plan necessarily factor through the using the intent-aligned AGI to quickly commit some sort of pivotal act that flips the gameboard and prevents other intent-aligned AGIs from being used malevolently by self-interested or destructive (human) actors to gain a decisive strategic advantage? After all, it sure seems less than ideal to find yourself in a position where you can solve the theoretical parts of value alignment,[1] but you cannot implement that in practice because control over the entire future light cone has already been permanently taken over by an AGI intent-aligned to someone who does not care about any of your broadly prosocial goals...
- ^
In so far as something like this even makes sense, which I have already expressed my skepticism of many times, but I don't think I particularly want to rehash this discussion with you right now...
You've gotten a fair number of disagree-votes thus far, but I think it's generally correct to say that many (arguably most) prediction markets still currently lack the trading volume necessary to justify confidence that EMH-style arguments mean inefficiencies will be rapidly corrected. To a large extent, it's fair to say this is due to over-regulation and attempts at outright banning (perhaps the relatively recent 5th Circuit ruling in favor of PredictIt against the Commodities Future Trading Commission is worth looking at as a microcosm of how these legal battles are playing out in today's day and age).
Nevertheless, the standard theoretical argument that inefficiencies in prediction markets are exploitable and thus lead to a self-correcting mechanism still seems entirely correct, as Garrett Baker points out.
I think mind projection especially happens with value judgements - i.e. people treat "goodness" or "badness" as properties of things out in the world.
It's worth noting, I think, that Steve Byrnes has done a great job describing and analyzing this phenomenon in Section 2.2 of his post on Valence & Normativity. I have mentioned before that I think his post is excellent, so it seems worthwhile to signal-boost it here as well.
Cognitively speaking, treating value as a property of stuff in the world can be useful for planning
Also mentioned and analyzed in Section 2.3 of Byrnes's post :)
Another option is to move to the Platinum Rule: Treat others the way they would want to be treated.
This suffers from a different, but still very serious problem, namely the fact that most people are at least somewhat selfish and want to be treated in a way that gives them (from an overall societal perspective) an unjustifiably large amount of resources, priority, respect, epistemic authority, etc. Granting them this might be good for the individuals in those spots, but has a negative effect on society as a whole.
It sure would be nice and convenient if it just so happened that "do what person X tells you to" and "do what's best for the community around you" were always in alignment. After all, in such a world, we wouldn't need to make these tough choices or ever have to reject the requests made by concrete people in front of us because of abstract principles or consequentialist reasoning (which is often quite difficult to do, especially for otherwise nice, prosocial, and empathetic people). But, as it turns out, we live in a world with scarce resources, one in which avoiding conflict by taking a third option often fails, and we must confront the issues directly instead.
For example, if somebody tells you that the way they want you to treat them is to always agree with everything they are saying and to never criticize their statements, what are you meant to do? If you give in, this not only allows potentially bad and damaging claims to go unchallenged, but also sets up terrible incentives for everyone else to copy this behavior. But if you don't give in, suddenly you aren't using the Platinum Rule anymore; perhaps you have some set of principles that tell you where to draw the line between "spots where people are allowed to impose their desires for how they should be treated on you" and "spots where such requests should be given no deference whatsoever," but now you've just moved the entire discussion up a step without major changes to the status quo; everyone will start arguing over where the line should be drawn and which side of it any given situation falls on.
There is also another issue, namely that if you empower individuals to affect great change in how others interact with them solely on the basis of their preferences, you are setting up bad incentives for how those individuals' preferences will change over time. As Vladimir_M has written in a classic response[1] to Scott Alexander's old post:
In a world where people make decisions according to this principle, one has the incentive to self-modify into a utility monster who feels enormous suffering at any actions of other people one dislikes for whatever reason. And indeed, we can see this happening to some extent: when people take unreasonable offense and create drama to gain concessions, their feelings are usually quite sincere.
You say, "pretending to be offended for personal gain is... less common in reality than it is in people's imaginations." That is indeed true, but only because people have the ability to whip themselves into a very sincere feeling of offense given the incentive to do so. Although sincere, these feelings will usually subside if they realize that nothing's to be gained.
And then the follow-up:
If we have a dispute and I credibly signal that I'm going to flip out and create drama out of all proportion to the issue at stake, you're faced with a choice between conceding to my demands or getting into an unpleasant situation that will cost more than the matter of dispute is worth. I'm sure you can think of many examples where people successfully get the upper hand in disputes using this strategy. The only way to disincentivize such behavior is to pre-commit credibly to be defiant in face of threats of drama. In contrast, if you act like a (naive) utilitarian, you are exceptionally vulnerable to this strategy, since I don't even need drama to get what I want, if I can self-modify to care tremendously about every single thing I want. (Which I won't do if I'm a good naive utilitarian myself, but the whole point is that it's not a stable strategy.)
Now, the key point is that such behavior is usually not consciously manipulative and calculated. On the contrary -- someone flipping out and creating drama for a seemingly trivial reason is likely to be under God-honest severe distress, feeling genuine pain of offense and injustice. This is a common pattern in human social behavior: humans are extremely good at detecting faked emotions and conscious manipulation, and as a result, we have evolved so that our brains lash out with honest strong emotion that is nevertheless directed by some module that performs game-theoretic assessment of the situation. This of course prompts strategic responses from others, leading to a strategic arms race without end.
The further crucial point is that these game-theoretic calculators in our brains are usually smart enough to assess whether the flipping out strategy is likely to be successful, given what might be expected in response. Basically, it is a part of the human brain that responds to rational incentives even though it's not under the control of the conscious mind. With this in mind, you can resolve the seeming contradiction between the sincerity of the pain of offense and the fact that it responds to rational incentives.
- ^
And not to get in the weeds of political issues too much, but I think the claims in his comment have been shown to be correct, given changes in social and political discourse in the US, at least, in the past 10 years or so.
Ben Pace has said that perhaps he doesn't disagree with you in particular about this, but I sure think I do.[1]
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline.
I don't see how the first half of this could be correct, and while the second half could be true, it doesn't seem to me to offer meaningful support for the first half either (instead, it seems rather... off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place.
[1] Quite the opposite, actually, if the change in the wider society's opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven't addressed, and which still seem to me to be entirely correct).
- ^
He also said that he thinks your perspective makes sense, which... I'm not really sure about.
Definitely not trying to put words in Habryka's mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:
- the only work which is relevant is the one that tries to directly tackle what Nate Soares described as "the hard bits of the alignment challenge" (the identity of which Habryka basically agrees with Soares about)
- nobody is fully on the ball yet
- but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that's most relevant, in theory
- however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes
Edit: I was wrong.
Updatelessness sure seems nice from a theoretical perspective, but it has a ton of problems that go beyond what you just mentioned and which seem to me to basically doom the entire enterprise (at least with regards to what we are currently discussing, namely people):
- I am not aware of any method of operationalizing even a weak version of updatelessness in the context of cognitively limited human beings that do not have access to their own source code
- I am pretty sure that a large portion of my values (and, by extension, the values of the vast majority of people) are indexical in nature, at least partly because my access to the outside world is mediated through sense data, which my S1 seems to value "terminally" and not as a mere proxy for preferences over current world-states. Indexicality seems to me to play very poorly with updatelesness (although I suspect you would know more about this than me, given your work in this area?)
- I don't currently know of a way that humans can remain updateless even under (what seems to be like an inordinately optimistic) world in which we can actually access the "source code" by figuring out how to model the abstract classical computation performed by a particular (and reified) subset of the brain's electronic circuit, basically because of the reasons I gave in my comment to Wei Dai that I referenced earlier ("The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?")
- I have a much broader skepticism about whether the concepts of "beliefs" and "values" make sense as distinct, coherent concepts that carve reality at the joints, and which I think is reflected in some of the other points I made in my long list of questions and confusions about these matters. It doesn't really seem to me like updatelessness solves this, or even necessarily offers a concrete path forward on it.
Of course, I don't expect that you are trying to literally say that going updateless gets rid of all the issues, but rather that thinking about it in those terms, after internalizing that perspective, helps put us in the right frame of mind to make progress on these philosophical and metaphilosophical matters moving forward. But, as I said at the end of my comment to Wei Dai:
I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you have been signaling throughout your recent posts and comments, it doesn't seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.
OK, but what is your “intent”? Presumably, it’s that something be done in accordance with your values-on-reflection, right?
No, I don't think so at all. Pretty much the opposite, actually; if it was in accordance to my values-on-reflection, it would be value-aligned to me rather than intent-aligned. Collapsing the meaning of the latter into the former seems entirely unwise to me. After all, when I talk about my intent, I am explicitly not thinking about any long reflection process that gets at the "core" of my beliefs or anything like that;[1] I am talking more about something like this:
I have preferences right now; this statement makes sense in the type of low-specificity conversation dominated by intuition where we talk about such words as though they referred to real concepts that point to specific areas of reality. Those preferences are probably not coherent, in the case that I can probably be money pumped by an intelligent enough agent that sets up a strange-to-my-current-self scenario. But they still exist, and one of them is to maintain a sufficient amount of money in my bank account to continue living a relatively high-quality life. Whether I "endorse" those preferences or not is entirely irrelevant to whether I have them right now; perhaps you could offer a rational argument to eventually convince me that you would make much better use of all my money, and then I would endorse giving you that money, but I don't care about any of that right now. My current, unreflectively-endorsed self, doesn't want to part with what's in my bank account, and that's what guiding my actions, not an idealized, reified future version.
None of this means anything conclusive about me ultimately endorsing these preferences in the reflective limit, of those preferences being stable under ontology shifts that reveal how my current ontology is hopelessly confused and reifies the analogues of ghosts, of there being any nonzero intersection between the end states of a process that tries to find my individual volition, of changes to my physical and neurological make-up keeping my identity the same (in a decision-relevant sense relative to my values) when my memories and path through history change.
In any case, I am very skeptical of this whole values-on-reflection business,[2] as I have written about at length in many different spots (1, 2, 3 come to mind off the top of my head). I am loathe to keep copying the exposition of the same ideas over and over and over again (it also probably gets annoying to read at some point), but here is a relevant sample:
Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to [Wei Dai] in particular here, since [Wei Dai] has already signaled an appropriate level of confusion about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.
What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don't seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?
[...]
What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like "CEV" probably doesn't make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?).
I do have some other thoughts on other parts of the post, which I might write out at some point.
I don't really think any of that affects the difficulty of public communication
The basic point would be that it's hard to write publicly about how you are taking responsible steps that grapple directly with the real issues... if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl's characterization of Anthropic's agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to "talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic" and the counterargument is that doing so would be nice in an ideal world, except it's very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, "If you want to convince people of something, it's much easier if it's true."
As I see it, not being able to bring up Anthropic's work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company's safety team).
- ^
Quite the opposite, actually, if the change in the wider society's opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
Ah, oops.
I think "The first AGI probably won't perform a pivotal act" is by far the weakest section.
To start things off, I would predict a world with slow takeoff and personal intent-alignment looks far more multipolar than the standard Yudkowskian recursively self-improving singleton that takes over the entire lightcone in a matter of "weeks or hours rather than years or decades". So the title of that section seems a bit off because, in this world, what the literal first AGI does becomes much less important, since we expect to see other similarly capable AI systems get developed by other leading labs relatively soon afterwards anyway.
But, in any case, the bigger issue I have with the reasoning there is the assumption (inferred from statements like "the humans in charge of AGI may not have the chutzpah to even try such a thing") that the social response to the development of general intelligence is going to be... basically muted? Or that society will continue to be business-as-normal in any meaningful sense? I would be entirely shocked if the current state of the world in which the vast majority of people have little knowledge of the current capabilities of AI systems and are totally clueless about the AI labs' race towards AGI were to continue past the point that actual AGI is reached.
I think intuitions of the type that "There's No Fire Alarm for Artificial General Intelligence" are very heavily built around the notion of rapid takeoff that is so fast there might well be no major economic evidence of the impact of AI before the most advanced systems become massively superintelligent. Or that there might not be massive rises in unemployment negatively impacting many people who are trying to live through the transition to an eventual post-scarcity economy. Or that the ways people relate to AIs or to one another will not be completely turned on their heads.
A future world in which we get pretty far along the way to no longer needing old OSs or programming languages because you can get an LLM to write really good code for you, in which AI can write an essay better than most (if not all) A+ undergrad students, in which it can solve Olympiad math problems better than all contestants and do research better than a graduate student, in which deep-learning based lie detection technology actually gets good and starts being used more and more, in which major presidential candidates are already using AI-generated imagery and causing controversies over whether others are using similar technology, in which the capacity to easily generate whatever garbage you request breaks the internet or fills it entirely with creepy AI-generated propaganda videos made by state-backed cults, is a world in which stability and equilibrium are broken. It is not a world in which "normality" can continue, in the sense that governments and people keep sleepwalking through the threat posed by AI.
I consider it very unlikely that such major changes to society can go by without the fundamental thinking around them changing massively, and without those who will be close to the top of the line of "most informed about the capabilities of AI" grasping the importance of the moment. Humans are social creatures who delegate most of their thinking on what issues should even be sanely considered to the social group around them; a world with slow takeoff is a world in which I expect massive changes to happen during a long enough time-span that public opinion shifts, dragging along with it both the Overton window and the baseline assumptions about what can/must be done about this.
There will, of course, be a ton of complicating factors that we can discuss, such as the development of more powerful persuasive AI catalyzing the shift of the world towards insanity and inadequacy, but overall I do not expect the argument in this section to follow through.
Right, but such an argument would not be "sound" from a theoretical logical perspective (according to the definition I mentioned in my previous comment), which is the only point I meant to get across earlier.
No, Z literally cannot contain R because R>Z.
I don't see what this has do to with randomness or countability? You are the one who brought those two notions up, and that part my response only meant to deal with them.
You maybe confusing "sound" with "proven".
No, I am using "sound" in the standard philosophical sense as meaning an argument that is both valid and has true premises, which we do not know holds here because we do not know that the premise is correct.
Why does the inline react tool require the selection of a unique snippet of text? This is slightly counterproductive sometimes, for example when trying to react to a particular word or phrase with "Taboo those words?" (often times, the reason why I would like to react that way is precisely because that word was used very many times in that comment/post...)
No, I don't think that alone would do it, either.
As a basic counterexample, just consider a fully empty infinite universe. It is in equilibrium (and does not violate any known laws of physics), it has an infinite size, and it adheres to the cosmological principle (because every single region is just as empty as any other region). And yet, it quite obviously does not contain every possible configuration of atoms that the laws of physics would allow...[1]
Or a universe that just has copies of the same non-empty local structure, repeated in an evenly-spaced grid. From the perspective of any of the local structures, the universe looks the same in every direction. But the collection of possible states is confined to be finite by the repeated tiling pattern.
- ^
Unless we use a definition of "possible" that just collapses into tautology due to macro-scale determinism...
This is because Z:
1. Isn't random2. is uncountably smaller than R
This is imprecise. It is more useful to say that it happens because we literally have .
Indeed, randomness and countability have little to do with this situation. Consider , where B is a random set of nonzero rational numbers such that . Then S is a random set (i.e., a set-valued random variable) that is not uncountably smaller than (the difference between and S is included in the countable set ), and yet we know for sure that not all real numbers are in S (because, for example, 0 cannot be an element of S).
Saying every finite combination of atoms exists an infinite number of times in an unbounded universe is more like saying every finite sequence of digits exists an infinite number of times in the digits of pi.
Note that the last property (aka, the idea that pi is normal) is not something that has been proven. So if you are trying to use the notion that it has to be true in an analogy with the idea that "every finite combination of atoms exists an infinite number of times in an unbounded universe" must also be true, this would not be a sound argument.
In light of this particular example, I also don't really understand why you focused on randomness in your previous comment. After all, pi is not a "random" number under the most natural meaning of that term; it and its digits are fully deterministic.
the universe is infinite in the sense that every possible combination of atoms is repeated an infinite number of times (either because the negative curvature of the universe implies the universe is unbounded or because of MWI)
As a side note, I do not like seeing these two possible causes of this belief be put in the same bucket. I think they are quite different in an important respect: while it seems MWI would indeed imply this conclusion, negative curvature alone certainly would not. A universe that is unbounded in size can certainly be bounded in a lot of other respects, and there is no particularly persuasive reason to think that "every possible combination of atoms is repeated an infinite number of times."
After all, the set of all integers is unbounded and infinite, but this does not imply that every real number occurs inside of it.
the requirements for finding truth were, in decreasing order of importance, luck, courage, and technique (and this surely applies to most endeavours, not just the search for truth)
Perhaps this might be the order of importance of these factors in the quest of finding any particular truth, but in the aggregate, I would expect technique (i.e., basic principles of rationality that tell you what truth is, what it should imply, how to look for it, what can justifiably change your view about it, etc) to be the most important one in the long run. This is mostly because it is the one that scales best when the world around us changes such that there is a greater supply of information out there from which important insights can be drawn.
Ok, the information's harmful. You need humans to touch that info anyways to do responsible risk-mitigation. So now what ?
I think one of the points is that you should now focus on selective rather than corrective or structural means to figure out who is nonetheless allowed to work on the basis of this information.
Calling something an infohazard, at least in my thinking, generally implies both that:
- any attempts to devise galaxy-brained incentive structures that try to get large groups of people to nonetheless react in socially beneficial ways when they access this information are totally doomed and should be scrapped from the beginning.
- you absolutely should not give this information to anyone that you have doubts would handle it well; musings along the lines of "but maybe I can teach/convince them later on what the best way to go about this is" are generally wrong and should also be dismissed.
So what do you do if you nonetheless require that at least some people are keeping track of things? Well, as I said above, you use selective methods instead. More precisely, you carefully curate a very short list of human beings that are responsible people and likely also share your meta views on how dangerous truths ought to be handled, and you do your absolute best to make sure the group never expands beyond those you have already vetted as capable of handling the situation properly.
But you have also written yourself a couple of years ago:
if aligned AGI gets here I will just tell it to reconfigure my brain not to feel bored, instead of trying to reconfigure the entire universe in an attempt to make monkey brain compatible with it. I sorta consider that preference a lucky fact about myself, which will allow me to experience significantly more positive and exotic emotions throughout the far future, if it goes well, than the people who insist they must only feel satisfied after literally eating hamburgers or reading jokes they haven't read before.
And indeed, when talking specifically about the Fun Theory sequence itself, you said:
I think Eliezer just straight up tends not to acknowledge that people sometimes genuinely care about their internal experiences, independent of the outside world, terminally. Certainly, there are people who care about things that are not that, but Eliezer often writes as if people can't care about the qualia - that they must value video games or science instead of the pleasure derived from video games or science.
Do you no longer endorse this?