Posts
Comments
Thanks for the answer. It's nice to get data about how other people think about this subject.
the concern that the more sociopathic people wind up in positions of power is the big concern.
Agreed!
Do I understand correctly: You'd guess that
- 99% of humans have a "positive empathy-sadism balance",
- and of those, 90-99% could be trusted to control the world (via controlling ASI),
- i.e., ~89-98% of humanity could be trusted to control the world with ASI-grade power?
If so, then I'm curious -- and somewhat bewildered! -- as to how you arrived at those guesses/numbers.
I'm under the impression that narcissism and sadism have prevalences of very roughly 6% and 4%, respectively. See e.g. this post, or the studies cited therein. Additionally, probably something like 1% to 10% of people are psychopaths, depending on what criteria are used to define "psychopathy". Even assuming there's a lot of overlap, I think a reasonable guess would be that ~8% of humans have at least one of those traits. (Or 10%, if we include psychopathy.)
I'm guessing you disagree with those statistics? If yes, what other evidence leads you to your different (much lower) estimates?
Do you believe that someone with (sub-)clinical narcissism, if given the keys to the universe, would bring about good outcomes for all (with probability >90%)? Why/how? What about psychopaths?
Do you completely disagree with the aphorism that "power corrupts, and absolute power corrupts absolutely"?
Do you think that having good intentions (and +0 to +3 SD intelligence) is probably enough for someone to produce good outcomes, if they're given ASI-grade power?
FWIW, my guesstimates are that
- over 50% of genpop would become corrupted by ASI-grade power, or are sadistic/narcissistic/psychopathic/spiteful to begin with,
- of the remainder, >50% would fuck things up astronomically, despite their good intentions[1],
- genetic traits like psychopathy and narcissism (not sure about sadism), and acquired traits like cynicism, are much more prevalent (~5x odds?) in people who will end up in charge of AGI projects, relative to genpop. OTOH, competence at not-going-insane is likely higher among them too.
it would be so easy to benefit humanity, just by telling your slave AGI to go make it happen. A lot of people would enjoy being hailed as a benevolent hero
I note that if someone is using an AGI as a slave, and is motivated by wanting prestige status, then I do not expect that to end well for anyone else. (Someone with moderate power, e.g. a medieval king, with the drive to be hailed a benevolent hero, might indeed do great things for other people. But someone with more extreme power -- like ASI-grade power -- could just... rewire everyone's brains; or create worlds full of suffering wretches, for him to save and be hailed/adored by; or... you get the idea.)
Even relatively trivial things like social media or drugs mess lots of humans up; and things like "ability to make arbitrary modifications to your mind" or "ability to do anything you want, to anyone, with complete impunity" are even further OOD, and open up even more powerful superstimuli/reward-system hacks. Aside from tempting/corrupting humans to become selfish, I think that kind of situation has high potential to just lead to them going insane or breaking (e.g. start wireheading) in any number of ways.
And then there are other failure modes, like insufficient moral uncertainty and locking in some parochial choice of values, or a set of values that made sense in some baseline human context but which generalize to something horrible. ("Obviously we should fill the universe with Democracy/Christianity/Islam/Hedonism/whatever!", ... "Oops, turns out Yahweh is pretty horrible, actually!") ↩︎
I'd be interested to see that draft as a post!
What fraction of humans in set X would you guess have a "positive empathy-sadism balance", for
- X = all of humanity?
- X = people in control of (governmental) AGI projects?
I agree that the social environment / circumstances could have a large effect on whether someone ends up wielding power selfishly or benevolently. I wonder if there's any way anyone concerned about x/s-risks could meaningfully affect those conditions.
I'm guessing[1] I'm quite a bit more pessimistic than you about what fraction of humans would produce good outcomes if they controlled the world.
with a lot of uncertainty, due to ignorance of your models. ↩︎
I agree that "strengthening democracy" sounds nice, and also that it's too vague to be actionable. Also, what exactly would be the causal chain from "stronger democracy" (whatever that means) to "command structure in the nationalized AGI project is trustworthy and robustly aligned to the common good"?
If you have any more concrete ideas in this domain, I'd be interested to read about them!
Pushing for nationalization or not might affect when it's done, giving some modicum of control.
I notice that I have almost no concrete model of what that sentence means. A couple of salient questions[1] I'd be very curious to hear answers to:
-
What concrete ways exist for affecting when (and how) nationalization is done? (How, concretely, does one "push" for/against nationalization of AGI?)
-
By what concrete causal mechanism could pushing for nationalization confer a modicum of control; and control over what exactly, and to whom?
Other questions I wish I (or people advocating for any policy w.r.t. AGI) had answers to include (i.a.) "How could I/we/anyone ensure that the resulting AGI project actually benefits everyone? Who, in actual concrete practice, would end up effectively having control over the AGI? How could (e.g.) the public hold those people accountable, even as those people gain unassailable power? How do we ensure that those people are not malevolent to begin with, and also don't become corrupted by power? What kinds of oversight mechanisms could be built, and how?" ↩︎
make their models sufficiently safe
What does "safe" mean, in this post?
Do you mean something like "effectively controllable"? If yes: controlled by whom? Suppose AGI were controlled by some high-ranking people at (e.g.) the NSA; with what probability do you think that would be "safe" for most people?
Doing nationalization right
I think this post (or the models/thinking that generated it) might be missing an important consideration[1]: "Is it possible to ensure that the nationalized AGI project does not end up de facto controlled by not-good people? If yes, how?"
Relevant quote from Yudkowsky's Six Dimensions of Operational Adequacy in AGI Projects (emphasis added):
Opsec [...] Military-grade or national-security-grade security. (It's hard to see how attempts to get this could avoid being counterproductive, considering the difficulty of obtaining trustworthy command and common good commitment with respect to any entity that can deploy such force [...])
Another quote (emphasis mine):
You cannot possibly cause such a project[2] to exist with adequately trustworthy command, alignment mindset, and common-good commitment, and you should therefore not try to make it exist, first because you will simply create a still more dire competitor developing unaligned AGI, and second because if such an AGI could be aligned it would be a hell of an s-risk given the probable command structure.
or possibly a crucial consideration ↩︎
The quote is referring to "[...] a single global Manhattan Project which is somehow not answerable to non-common-good command such as Trump or Putin or the United Nations Security Council. [...]" ↩︎
A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like
- legal systems, police
- immune systems (esp. in suppressing cancer)
- social norms, reputation systems, etc.
I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.
Do you think a good theory of hierarchical agency would subsume those kinds of patterns-in-reality? If yes: I wonder if their inclusion could be used as a criterion/heuristic for narrowing down the search for a good theory?
find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system
We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it's feasible[1] to build any kind of "acute risk period -ending AI"[2] that is not a "subject of AGI-doom arguments"?
I'd be (very) curious to see such arguments. [3]
within time constraints, before anyone else builds a "subject of AGI-doom arguments" ↩︎
or, "AIs that implement humanity's CEV" ↩︎
If I became convinced that it's feasible to build such a "pivotal AI" that is not "subject to AGI doom arguments", I think that would shift a bunch of my probability mass from "we die due to unaligned AI" to "we die-or-worse due to misaligned humans controlling ASI" and "utopia". ↩︎
I think this is an important subject and I agree with much of this post. However, I think the framing/perspective might be subtly but importantly wrong-or-confused.
To illustrate:
How much of the issue here is about the very singular nature of the One dominant project, vs centralization more generally into a small number of projects?
Seems to me that centralization of power per se is not the problem.
I think the problem is something more like
-
we want to give as much power as possible to "good" processes, e.g. a process that robustly pursues humanity's CEV[1]; and we want to minimize the power held by "evil" processes
-
but: a large fraction of humans are evil, or become evil once prosocial pressures are removed; and we do not know how to reliably construct "good" AIs
-
and also: we (humans) are confused and in disagreement about what "good" even means
-
and even if it were clear what a "good goal" is, we have no reliable way of ensuring that an AI or a human institution is robustly pursuing such a goal.
I agree that (given the above conditions) concentrating power into the hands of a few humans or AIs would on expectation be (very) bad. (OTOH, a decentralized race is also very bad.) But concentration-vs-decentralization of power is just one relevant consideration among many.
Thus: if the quoted question has an implicit assumption like "the main variable to tweak is distribution-of-power", then I think it is trying to carve the problem at unnatural joints, or making a false implicit assumption that might lead to ignoring multiple other important variables.
(And less centralization of power has serious dangers of its own. See e.g. Wei Dai's comment.)
I think a more productive frame might be something like "how do we construct incentives, oversight, distribution of power, and other mechanisms, such that Ring Projects remain robustly aligned to 'the greater good'?"
And maybe also "how do we become less confused about what 'the greater good' even is, in a way that is practically applicable to aligning Ring Projects?"
If such a thing is even possible. ↩︎
Upvoted and disagreed. [1]
One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly "useful/capable".
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model's weights to 0.0; thereby guaranteeing the non-entrainment of any ("bad") circuits.
I'm curious: what do you think would be a good (...useful?) operationalization of "useful/capable"?
Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model's outputs might cause catastrophe. [2]
I think writing one's thoughts/intuitions out like this is valuable --- for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best). ↩︎
Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly "localized/concentrated" in some sense. (OTOH, that seems likely to at least eventually be the case?) ↩︎
In Fig 1, is the vertical axis P(world) ?
Possibly a nitpick, but:
The development and deployment of AGI, or similarly advanced systems, could constitute a transformation rivaling those of the agricultural and industrial revolutions.
seems like a very strong understatement. Maybe replace "rivaling" with e.g. "(vastly) exceeding"?
Referring to the quote-picture from the Nvidia GTC keynote talk: I searched the talk's transcript, and could not find anything like the quote.
Could someone point out time-stamps of where Huang says (or implies) anything like the quote? Or is the quote entirely made up?
That clarifies a bunch of thing. Thanks!
I'm not sure I understand what the post's central claim/conclusion is. I'm curious to understand it better. To focus on the Summary:
So overall, evolution is the source of ethics,
Do you mean: Evolution is the process that produced humans, and strongly influenced humans' ethics? Or are you claiming that (humans') evolution-induced ethics are what any reasonable agent ought to adhere to? Or something else?
and sapient evolved agents inherently have a dramatically different ethical status than any well-designed created agents [...]
...according to some hypothetical evolved agents' ethical framework, under the assumption that those evolved agents managed to construct the created agents in the right ways (to not want moral patienthood etc.)? Or was the quoted sentence making some stronger claim?
evolution and evolved beings having a special role in Ethics is not just entirely justified, but inevitable
Is that sentence saying that
- evolution and evolved beings are of special importance in any theory of ethics (what ethics are, how they arise, etc.), due to Evolution being one of the primary processes that produce agents with moral/ethical preferences [1]
or is it saying something like
- evolution and evolved beings ought to have a special role; or we ought to regard the preferences of evolved beings as the True Morality?
I roughly agree with the first version; I strongly disagree with the second: I agree that {what oughts humans have} is (partially) explained by Evolutionary theory. I don't see how that crosses the is-ought gap. If you're saying that that somehow does cross the is-ought gap, could you explain why/how?
I.e., similar to how one might say "amino acids having a special role in Biochemistry is not just entirely justified, but inevitable"? ↩︎
I wonder how much work it'd take to implement a system that incrementally generates a graph of the entire conversation. (Vertices would be sub-topics, represented as e.g. a thumbnail image + a short text summary.) Would require the GPT to be able to (i.a.) understand the logical content of the discussion, and detect when a topic is revisited, etc. Could be useful for improving clarity/productivity of conversations.
One of the main questions on which I'd like to understand others' views is something like: Conditional on sentient/conscious humans[1] continuing to exist in an x-risk scenario[2], with what probability do you think they will be in an inescapable dystopia[3]?
(My own current guess is that dystopia is very likely.)
or non-human minds, other than the machines/Minds that are in control ↩︎
as defined by Bostrom, i.e. "the permanent and drastic destruction of [humanity's] potential for desirable future development" ↩︎
Versus e.g. just limited to a small disempowered population, but living in pleasant conditions? Or a large population living in unpleasant conditions, but where everyone at least has the option of suicide? ↩︎
That makes sense; but:
so far outside the realm of human reckoning that I'm not sure it's reasonable to call them dystopian.
setting aside the question of what to call such scenarios, with what probability do you think the humans[1] in those scenarios would (strongly) prefer to not exist?
or non-human minds, other than the machines/Minds that are in control ↩︎
non-extinction AI x-risk scenarios are unlikely
Many people disagreed with that. So, apparently many people believe that inescapable dystopias are not-unlikely? (If you're one of the people who disagreed with the quote, I'm curious to hear your thoughts on this.)
(Ah. Seems we were using the terms "(alignment) success/failure" differently. Thanks for noting it.)
In-retrospect-obvious key question I should've already asked: Conditional on (some representative group of) humans succeeding at aligning ASI, what fraction of the maximum possible value-from-Evolution's-perspective do you expect the future to attain? [1]
My modal guess is that the future would attain ~1% of maximum possible "Evolution-value".[2]
If tech evolution is similar enough to bio evolution then we should roughly expect tech evolution to have a similar level of success
Seems like a reasonable (albeit very preliminary/weak) outside view, sure. So, under that heuristic, I'd guess that the future will attain ~1% of max possible "human-value".
Evolution has succeeded at aligning homo sapiens brains to date
I'm guessing we agree on the following:
-
Evolution shaped humans to have various context-dependent drives (call them Shards) and the ability to mentally represent and pursue complex goals. Those Shards were good proxies for IGF in the EEA[1].
-
Those Shards were also good[2] enough to produce billions of humans in the modern environment. However, it is also the case that most modern humans spend at least part of their optimization power on things orthogonal to IGF.
I think our disagreement here maybe boils down to approximately the following question:
With what probability are we in each of the following worlds?
-
(World A) The Shards only work[2:1] conditional on the environment being sufficiently similar to the EEA, and humans not having too much optimization power. If the environment changes too far OOD, or if humans were to gain a lot of power[3], then the Shards would cease to be good[2:2] proxies.
In this world, we should expect the future to contain only a small fraction[4] of the "value" it would have, if humanity were fully "aligned"[2:3]. I.e. Evolution failed to "(robustly) align humanity".
-
(World B) The Shards (in combination with other structures in human DNA/brains) are in fact sufficiently robust that they will keep humanity aligned[2:4] even in the face of distributional shift and humans gaining vast optimization power.
In this world, we should expect the future to contain a large fraction of the "value" it would have, if humanity were fully "aligned"[2:5]. I.e. Evolution succeeded in "(robustly) aligning humanity".
-
(World C) Something else?
I think we're probably in (A), and IIUC, you think we're most likely in (B). Do you consider this an adequate characterization?
If yes, the obvious next question would be: What tests could we run, what observations could we make,[5] that would help us discern whether we're in (A) or (B) (or (C))?
(For example: I think the kinds of observations I listed in my previous comment are moderate-to-strong evidence for (A); and the existence of some explicit-IGF-maximizing humans is weak evidence for (B).)
Environment of evolutionary adaptedness. For humans: hunter-gatherer tribes on the savanna, or maybe primitive subsistence agriculture societies. ↩︎
in the sense of optimizing for IGF, or whatever we're imagining Evolution to "care" about. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
e.g. ability to upload their minds, construct virtual worlds, etc. ↩︎
Possibly (but not necessarily) still a large quantity in absolute terms. ↩︎
Without waiting a possibly-long time to watch how things in fact play out. ↩︎
vast computation some of which is applied to ancestral simulations
I agree that a successful post-human world would probably involve a large amount[1] of resources spent on simulating (or physically instantiating) things like humans engaging in play, sex, adventure, violence, etc. IOW, engaging in the things for which Evolution installed Shards in us. However, I think that is not the same as [whatever Evolution would care about, if Evolution could care about anything]. For the post-human future to be a success from Evolution's perspective, I think it would have to be full of something more like [programs (sentient or not, DNA or digital) striving to make as many copies of themselves as possible].
(If we make the notion of "DNA" too broad/vague, then we could interpret almost any future outcome as "success for Evolution".)
a large absolute amount, but maybe not a large relative amount. ↩︎
Humans have not put an end to biological life.
Yup. I, too, have noticed that.
Your doom[1] predictions [...]
C'mon, man, that's obviously a misrepresentation of what I was saying. Or maybe my earlier comment failed badly at communication? In case that's so, here's an attempted clarification (bolded parts added):
If Evolution had a lot more time (than I expect it to have) to align humans to relative-gene-replication-count, before humans put an end to biological life , as they seem to me to be on track to do, based on things I have observed in the past, then [...] But Evolution (almost surely) does not have infinite time [...]
Point being: Sure, Evolution managed to cough up some individuals who explicitly optimize for IGF. But they're exceptions, not the rule; and humanity seems (based on past observations!) to be on track to (mostly) end DNA-based life. So it seems premature to say that Evolution succeeded at aligning humanity.
In case you're wondering what past observations lead me to think that humans are unaligned[2] w.r.t. IGF and on track to end (or transcend) biological life, here are some off the top of my head:
-
Of the people whose opinions on the subject I'm aware of (including myself), nearly all would like to transcend (or end) biological life.[3]
-
Birth rates in most developed nations have been low or below replacement for a long time.[4] There seems to be a negative correlation between wealth/education and number of offspring produced. That matches my impression that as people gain wealth, education, and empowerment in general, most choose to spend it mostly on something other than producing offspring.
-
Diligent sperm bank donors are noteworthy exceptions. Most people are not picking obvious low-hanging fruit to increasing their IGF. Rich people waste money on yachts and stuff, instead of using it to churn out as many high-fitness offspring as possible; etc.
-
AFAIK, most of the many humans racing to build ASI are not doing so with the goal of increasing their IGF. And absent successful attempts to align ASI specifically to producing lots of DNA-based replicators, I don't see strong reason to expect the future to be optimized for quantity of DNA-based replicators.
Perhaps you disagree with the last point above?
There are many other scenarios where DNA flourishes even after a posthuman transition.
Interesting. Could you list a few of those scenarios?
Note: I wasn't even talking (only) about doom; I was talking about humanity seemingly being on track to end biological life. I think the "good" outcomes probably also involve transcending biology/DNA-based replicators. ↩︎
to the extent that it even makes sense to talk about incoherent things like humans being "(mis/un)aligned" to anything. ↩︎
My sample might not be super representative of humanity as a whole. Maybe somewhat representative of people involved in AI, though? ↩︎
At least according to sources like this: https://en.wikipedia.org/wiki/Total_fertility_rate ↩︎
evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! [...]
If Evolution had a lot more time to align humans to relative-gene-replication-count, before humans put an end to biological life, then sure, seems plausible that Evolution might be able to align humans very robustly. But Evolution does not have infinite time or "retries" --- humanity is in the process of executing something like a "sharp left turn", and seems likely to succeed long before the human gene pool is taken over by sperm bank donors and such.
The utility function is fitness: gene replication count (of the human defining genes) [1]
Seems like humans are soon going to put an end to DNA-based organisms, or at best relegate them to some small fraction of all "life". I.e., seems to me that the future is going to score very poorly on the gene-replication-count utility function, relative to what it would score if humanity (or individual humans) were actually aligned to gene-replication-count.
Do you disagree? (Do you expect the post-ASI future to be tiled with human DNA?)
Obviously Evolution doesn't actually have a utility function, and if it did, gene-replication-count is probably not it, as TekhneMakre points out. But, let's accept that for now, arguendo. ↩︎
I mostly agree.
I also think that impact is very unevenly distributed over people; the most impactful 5% of people probably account for >70% of the impact. [1]
And if so, then the difference in positive impact between {informing the top 5%} and {broadcasting to the field in general on the open Internet} is probably not very large. [2]
Possibly also worth considering: Would (e.g.) writing a public post actually reach those few key people more effectively than (e.g.) sending a handful of direct/targeted emails? [3]
Talking about AI (alignment) here, but I think something like this applies in many fields. I don't have a good quantification of "impact" in mind, though, so this is very hand-wavey. ↩︎
Each approach has its downsides. The first approach requires identifying the relevant people, and is likely more effortful. The latter approach has the downside of putting potentially world-ending information in the hands of people who would use it to end the world (a bit sooner than they otherwise would). ↩︎
What is in fact the most effective way to reach whoever needs to be reached? (I don't know.) ↩︎
If {the reasoning for why AGI might not be near} comprises {a list of missing capabilities}, then my current guess is that the least-bad option would be to share that reasoning in private with a small number of relevant (and sufficiently trustworthy) people[1].
(More generally, my priors strongly suggest keeping any pointers to AGI-enabling capabilities private.)
E.g. the most capable alignment researchers who seem (to you) to be making bad strategic decisions due to not having considered {the reasoning for why AGI might not be near}. ↩︎
I can't critique your plan, because I can't parse your writing. My suggestion would be to put some effort into improving the clarity of your writing. [1]
Even basic things, such as the avoidance of long sentences, sometimes with side notes included and separated from the main sentence by commas, rather than e.g. em dashes, and making the scopes of various syntactic structures unambiguous, could go a long way towards making your text more legible. ↩︎
[...] bridge the "gap" between (less-precise proofs backed by advanced intuition) and (precise proofs simple enough for basically anyone to technically "follow").
Meta: Please consider using curly or square brackets ({} or []) for conceptual/grammatic grouping; please avoid overloading parentheses.
Thumbs up for trying to think of novel approaches to solving the alignment problem.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.
A few confusions:
- By "it gets a loss function", did you mean "it gets negative reward"?
- If yes, doesn't this plan consist entirely of reinforcement learning? How does this "emulate Evolution"?
- What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to... changes in the dumber agents' (D's) total utility?
Some problems, off the top of my head:
-
GPT-like models don't have utility functions.
-
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models' utility functions.
-
Humans don't have utility functions. It's unclear how this would generalize to human-alignment.
-
It's very much unclear what policy S would end up learning in this RL setup. It's even less clear how that policy would generalize outside of training.
- If S is given reward proportional to (changes in) D's utility, then basically we're just training S with D's utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. [1]
- If S is instead given reward for things like {taking actions that lead to obtaining information about D's utility function}, then... we're training an RL policy/agent on proxies to "alignment". I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.
I don't know how you arrived at this plan, but I'm guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:
- https://www.lesswrong.com/posts/GKfPL6LQFgB49FEnv/replace-the-symbol-with-the-substance
- https://www.lesswrong.com/posts/JcpzFpPBSmzuksmWM/the-5-second-level
Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D's be very accurate models of humans. But that's not remotely realistic; and still subject to Goodhart. ↩︎
Tamsin Leake's project might match what you're looking for.
[...] iteratively align superintelligence.
To align the first automated alignment researcher, [...]
To validate the alignment of our systems, [...]
What do they mean by "aligned"?
How do we ensure AI systems much smarter than humans follow human intent?
OK. Assuming that
- sharp left turns are not an issue,
- and scalable oversight is even possible in practice,
- and OAI somehow solves the problems of
- AIs hacking humans (to influence their intents),
- and deceptive alignment,
- humans going crazy when given great power,
- etc.
- and all the problems no-one has noticed yet,
then, there's the question of "aligned to what"? Whose intent? What would success at this agenda look like?
Maybe: A superintelligence that accurately models its human operator, follows the human's intent[1] to complete difficult-but-bounded tasks, and is runnable at human-speed with manageable amount of compute, sitting on OAI's servers?
Who would get to use that superintelligence? For what purpose would they use it? How long before the {NSA, FSB, CCP, ...} steal that superintelligence off OAI's servers? What would they use it for?
Point being: If an organization is not adequate in all key dimensions of operational adequacy, then even if they somehow miraculously solve the alignment/control problem, they might be increasing S-risks while only somewhat decreasing X-risks.
What is OAI's plan for getting their opsec and common-good-commitment to adequate levels? What's their plan for handling success at alignment/control?
and does not try to hack the human into having more convenient intents ↩︎
To what extent would you expect the government's or general populace's responses to "Robots with guns" to be helpful (or harmful) for mitigating risks from superintelligence? (Would getting them worried about robots actually help with x-risks?)
Right; that would be a silly thing to think.
My intended message might've been better worded as follows
If staring into abysses is difficult/rough, then adequately staring into the darker abysses might require counter-intuitively large amounts of effort/agency. And yet, I think it might be necessary to grok those darker abysses, if we are to avoid falling into them. That makes me worried.
OTOH, you seem exceptionally reflective, so perhaps that worry is completely unfounded in your case. Anyway, I'm grateful for the work you do; I wish there were more people like you in this world. (Also, your attention is extremely valuable, so please feel free to promptly drop/forget this conversation.)
When people call things like this post "rough to write/read", and consider them to require a content warning, I wonder if most people are able to think clearly (or at all) about actually terrible scenarios, and worry that they aren't. (I'm especially worried if those people have influence in a domain where there might be a tradeoff between mitigating X-risks vs mitigating S-risks.)
I liked the description of the good future, though. Thanks for the reminder that things can (maybe) go well, too.
Thanks for the response.
To the extent that I understand your models here, I suspect they don't meaningfully bind/correspond to reality. (Of course, I don't understand your models at all well, and I don't have the energy to process the whole post, so this doesn't really provide you with much evidence; sorry.)
I wonder how one could test whether or not the models bind to reality? E.g. maybe there are case examples (of agents/people behaving in instrumentally rational ways) one could look at, and see if the models postdict the actual outcomes in those examples?
Other agents are not random though.
I agree. But AFAICT that doesn't really change the conclusion that less agents would tend to make the world more predictable/controllable. As you say yourself:
I don't think other agents are different from any other physical system that might be more-or-less chaotic, unpredictable or difficult to control.
And that was the weaker of the two apparent problems. What about the {implied self-preservation and resource acquisition} part?
Firstly, a bias towards choices which leave less up to chance.
Wouldn't this imply a bias towards eliminating other agents? (Since that would make the world more predictable, and thereby leave less up to chance?)
And thirdly, a bias towards choices which afford more choices later on.
Wouldn't this strongly imply biases towards both self-preservation and resource acquisition?
If the above two implications hold, then the conclusion
that the biases induced by instrumental rationality at best weakly support [...] that machine superintelligence is likely to lead to existential catastrophe
seems incorrect, no?
Could you briefly explain what is wrong with the reasoning above, or point me to the parts of the post that do so? (I only read the Abstract.)
[...] bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.
What probability would you assign to humans remaining but not being able to kill themseleves; i.e., to unescapable dystopias (vs. dystopias whose badness for any individual are bounded by death-by-suicide)?
This post raised some interesting points, and stimulated a bunch of interesting discussion in the comments. I updated a little bit away from foom-like scenarios and towards slow-takeoff scenarios. Thanks. For that, I'd like to upvote this post.
On the other hand: I think direct/non-polite/uncompromising argumentation against other arguments, models, or beliefs is (usually) fine and good. And I think it's especially important to counter-argue possible inaccuracies in key models that lots of people have about AI/ML/alignment. However, in many places, the post reads like a personal attack on a person (Yudkowsky), rather than just on models/beliefs he has promulgated.
I think that style of discourse runs a risk of
- politicizing the topic under discussion, and thereby making it harder for people to think clearly about it
- creating a shitty culture where people are liable to get personally attacked for participating in that discourse
For that, I'd like to downvote this post. (I ended up neither up- nor down-voting.)
I will bet any amount of money that GPT-5 will not kill us all.
What's the exchange rate for USD to afterlife-USD, though? Or what if they don't use currency in the afterlife at all? Then how would you pay the other party back if you lose?
Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?
If we're literally considering a universal quantifier over the set of all possible ML models, then I'd think it extremely likely that there does exist a simpler model with perplexity no worse than GPT-4. I'm confused as to how you (seem to have) arrived at the opposite conclusion.
Imagine an astronomer in the year 1600, who frequently refers to the "giant inscrutable movements" of the stars. [...]
I think the analogy to {building intelligent systems} is unclear/weak. There seem to be many disanalogies:
In the astronomy case, we have
-
A phenomenon (the stars and their motions) that we cannot affect
-
That phenomenon is describable with simple laws.
-
The "piles of numbers" are detailed measurements of that phenomenon.
-
It is useful to take more measurements, doing so is helpful for finding the simple laws.
In the {building AIs} case, we have
-
A phenomenon (intelligence) which we can choose to implement in different ways, and which we want to harness for some purpose.
-
That phenomenon is probably not entirely implementable with simple programs. (As you point out yourself.)
-
The "piles of numbers" are the means by which some people are implementing the thing; as opposed to measurements of the one and only way the thing is implemented.
-
And so: implementing AIs as piles-of-numbers is not clearly helpful (and might be harmful) to finding better/simpler alternatives.
So, I predict with high confidence that any ML model that can reach the perplexity levels of Transformers will also present great initial interpretive difficulty.
I do agree that any realistic ML model that achieves GPT-4-level perplexity would probably have to have at least some parts that are hard to interpret. However, I believe it should (in principle) be possible to build ML systems that have highly interpretable policies (or analogues thereof), despite having hard-to-interpret models.
I think if our goal was to build understandable/controllable/safe AI, then it would make sense to factor the AI's mind into various "parts", such as e.g. a policy, a set of models, and a (set of sub-)goals.
In contrast, implementing AIs as giant Transformers precludes making architectural distinctions between any such "parts"; the whole AI is in a(n architectural) sense one big uniform soup. Giant Transformers don't even have the level of modularity of biological brains designed by evolution.
Consequently, I still think the "giant inscrutable tensors"-approach to building AIs is terrible from a safety perspective, not only in an absolute sense, but also in a relative sense (relative to saner approaches that I can see).
scale up the experiment of Pretraining from Human Feedback by using larger data
AFAICT, PHF doesn't solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I'd tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven't thought about PHF very thoroughly though, so I'm uncertain here.
we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words
Did you mean something like "(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc."? [1]
can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment
I don't understand this. What difference are you pointing at with "deceptive" vs "legitimate" generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning "deceptive" generalizations}?
I'm under the impression that entirely "legitimate" generalizations can (and apriori probably will) lead to "deception"; see e.g. https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness. Do you disagree with that? (If yes, how?)
can't amplify Goodhart
Side note: I don't understand what you mean by this (in the given context).
can't [...] hack the human's values
I don't see how this follows. IIUC, the proposition here is something like
- If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can't hack the humans.
Is that a reasonable representation of what you're saying? If yes, consider: What if we replace "the AI" with "Anonymous" and "the humans" with "the web server"? Then we get
- If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can't hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn't rely on hardware effects like rowhammering.
(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it's far from sufficient (to guarantee any form of safety).)
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I'm still confused as to why you think that. To the extent that I understood the reasons you presented, I think they're incorrect (as outlined above). (Maybe I'm misunderstanding something.)
I'm kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons. ↩︎
I don't quite understand what you're saying; I get the impression we're using different ontologies/vocabularies. I'm curious to understand your model of alignment, and below are a bunch of questions. I'm uncertain whether it's worth the time to bridge the conceptual gap, though --- feel free to drop this conversation if it feels opportunity-costly.
(1.)
Are you saying that if we assumed agents to be Cartesian[1], then you'd know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?
(2.)
Resolve the embedded alignment problems [...] This is essentially Reinforcement Learning from Human Feedback's method for alignment
How does RLHF solve problems of embedded alignment? I'm guessing you're referring to something other than the problems outlined in Embedded Agency?
(3.)
What exact distinction do you mean by "online" vs "offline"? Given that any useful/"pivotal" AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn "online", no?
(4.)
the data set used for human values
What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?
e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation. ↩︎
the more we crank up generalization ability, the better it's alignment
To me that seems almost correct, in a way that is dangerous. I'd agree with the statement that
the more we crank up generalization ability, the better the AI's ability to align to any given set of values/goals
But for that to lead to the AI being aligned with "good" values, we also need to somehow get the AI to choose/want to align with "good" values. (Whatever "good" even means; maybe humanity's CEV?) And that does not happen on its own, AFAICT.
I think it might be relevant to note here that it's not really humans who are building current SOTA AIs --- rather, it's some optimizer like SGD that's doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky's claim that it's hard/inefficient to build something without understanding it.
If Eliezer wasn't meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then [...]
I think it's important to distinguish between
-
Scaling up a neural network, and running some kind of fixed algorithm on it.
-
Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms.
IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don't know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so... *shrugs*.
- A task X reduces to task Y if and only if...
Shouldn't X and Y be the other way around there?
I feel like this conversation might be interesting to continue, if I had more bandwidth, but I don't. In any case, thanks for the linked article, looks interesting based on the abstract.