Posts
Comments
find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system
We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it's feasible[1] to build any kind of "acute risk period -ending AI"[2] that is not a "subject of AGI-doom arguments"?
I'd be (very) curious to see such arguments. [3]
within time constraints, before anyone else builds a "subject of AGI-doom arguments" ↩︎
or, "AIs that implement humanity's CEV" ↩︎
If I became convinced that it's feasible to build such a "pivotal AI" that is not "subject to AGI doom arguments", I think that would shift a bunch of my probability mass from "we die due to unaligned AI" to "we die-or-worse due to misaligned humans controlling ASI" and "utopia". ↩︎
I think this is an important subject and I agree with much of this post. However, I think the framing/perspective might be subtly but importantly wrong-or-confused.
To illustrate:
How much of the issue here is about the very singular nature of the One dominant project, vs centralization more generally into a small number of projects?
Seems to me that centralization of power per se is not the problem.
I think the problem is something more like
-
we want to give as much power as possible to "good" processes, e.g. a process that robustly pursues humanity's CEV[1]; and we want to minimize the power held by "evil" processes
-
but: a large fraction of humans are evil, or become evil once prosocial pressures are removed; and we do not know how to reliably construct "good" AIs
-
and also: we (humans) are confused and in disagreement about what "good" even means
-
and even if it were clear what a "good goal" is, we have no reliable way of ensuring that an AI or a human institution is robustly pursuing such a goal.
I agree that (given the above conditions) concentrating power into the hands of a few humans or AIs would on expectation be (very) bad. (OTOH, a decentralized race is also very bad.) But concentration-vs-decentralization of power is just one relevant consideration among many.
Thus: if the quoted question has an implicit assumption like "the main variable to tweak is distribution-of-power", then I think it is trying to carve the problem at unnatural joints, or making a false implicit assumption that might lead to ignoring multiple other important variables.
(And less centralization of power has serious dangers of its own. See e.g. Wei Dai's comment.)
I think a more productive frame might be something like "how do we construct incentives, oversight, distribution of power, and other mechanisms, such that Ring Projects remain robustly aligned to 'the greater good'?"
And maybe also "how do we become less confused about what 'the greater good' even is, in a way that is practically applicable to aligning Ring Projects?"
If such a thing is even possible. ↩︎
Upvoted and disagreed. [1]
One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly "useful/capable".
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model's weights to 0.0; thereby guaranteeing the non-entrainment of any ("bad") circuits.
I'm curious: what do you think would be a good (...useful?) operationalization of "useful/capable"?
Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model's outputs might cause catastrophe. [2]
I think writing one's thoughts/intuitions out like this is valuable --- for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best). ↩︎
Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly "localized/concentrated" in some sense. (OTOH, that seems likely to at least eventually be the case?) ↩︎
In Fig 1, is the vertical axis P(world) ?
Possibly a nitpick, but:
The development and deployment of AGI, or similarly advanced systems, could constitute a transformation rivaling those of the agricultural and industrial revolutions.
seems like a very strong understatement. Maybe replace "rivaling" with e.g. "(vastly) exceeding"?
Referring to the quote-picture from the Nvidia GTC keynote talk: I searched the talk's transcript, and could not find anything like the quote.
Could someone point out time-stamps of where Huang says (or implies) anything like the quote? Or is the quote entirely made up?
That clarifies a bunch of thing. Thanks!
I'm not sure I understand what the post's central claim/conclusion is. I'm curious to understand it better. To focus on the Summary:
So overall, evolution is the source of ethics,
Do you mean: Evolution is the process that produced humans, and strongly influenced humans' ethics? Or are you claiming that (humans') evolution-induced ethics are what any reasonable agent ought to adhere to? Or something else?
and sapient evolved agents inherently have a dramatically different ethical status than any well-designed created agents [...]
...according to some hypothetical evolved agents' ethical framework, under the assumption that those evolved agents managed to construct the created agents in the right ways (to not want moral patienthood etc.)? Or was the quoted sentence making some stronger claim?
evolution and evolved beings having a special role in Ethics is not just entirely justified, but inevitable
Is that sentence saying that
- evolution and evolved beings are of special importance in any theory of ethics (what ethics are, how they arise, etc.), due to Evolution being one of the primary processes that produce agents with moral/ethical preferences [1]
or is it saying something like
- evolution and evolved beings ought to have a special role; or we ought to regard the preferences of evolved beings as the True Morality?
I roughly agree with the first version; I strongly disagree with the second: I agree that {what oughts humans have} is (partially) explained by Evolutionary theory. I don't see how that crosses the is-ought gap. If you're saying that that somehow does cross the is-ought gap, could you explain why/how?
I.e., similar to how one might say "amino acids having a special role in Biochemistry is not just entirely justified, but inevitable"? ↩︎
I wonder how much work it'd take to implement a system that incrementally generates a graph of the entire conversation. (Vertices would be sub-topics, represented as e.g. a thumbnail image + a short text summary.) Would require the GPT to be able to (i.a.) understand the logical content of the discussion, and detect when a topic is revisited, etc. Could be useful for improving clarity/productivity of conversations.
One of the main questions on which I'd like to understand others' views is something like: Conditional on sentient/conscious humans[1] continuing to exist in an x-risk scenario[2], with what probability do you think they will be in an inescapable dystopia[3]?
(My own current guess is that dystopia is very likely.)
or non-human minds, other than the machines/Minds that are in control ↩︎
as defined by Bostrom, i.e. "the permanent and drastic destruction of [humanity's] potential for desirable future development" ↩︎
Versus e.g. just limited to a small disempowered population, but living in pleasant conditions? Or a large population living in unpleasant conditions, but where everyone at least has the option of suicide? ↩︎
That makes sense; but:
so far outside the realm of human reckoning that I'm not sure it's reasonable to call them dystopian.
setting aside the question of what to call such scenarios, with what probability do you think the humans[1] in those scenarios would (strongly) prefer to not exist?
or non-human minds, other than the machines/Minds that are in control ↩︎
non-extinction AI x-risk scenarios are unlikely
Many people disagreed with that. So, apparently many people believe that inescapable dystopias are not-unlikely? (If you're one of the people who disagreed with the quote, I'm curious to hear your thoughts on this.)
(Ah. Seems we were using the terms "(alignment) success/failure" differently. Thanks for noting it.)
In-retrospect-obvious key question I should've already asked: Conditional on (some representative group of) humans succeeding at aligning ASI, what fraction of the maximum possible value-from-Evolution's-perspective do you expect the future to attain? [1]
My modal guess is that the future would attain ~1% of maximum possible "Evolution-value".[2]
If tech evolution is similar enough to bio evolution then we should roughly expect tech evolution to have a similar level of success
Seems like a reasonable (albeit very preliminary/weak) outside view, sure. So, under that heuristic, I'd guess that the future will attain ~1% of max possible "human-value".
Evolution has succeeded at aligning homo sapiens brains to date
I'm guessing we agree on the following:
-
Evolution shaped humans to have various context-dependent drives (call them Shards) and the ability to mentally represent and pursue complex goals. Those Shards were good proxies for IGF in the EEA[1].
-
Those Shards were also good[2] enough to produce billions of humans in the modern environment. However, it is also the case that most modern humans spend at least part of their optimization power on things orthogonal to IGF.
I think our disagreement here maybe boils down to approximately the following question:
With what probability are we in each of the following worlds?
-
(World A) The Shards only work[2:1] conditional on the environment being sufficiently similar to the EEA, and humans not having too much optimization power. If the environment changes too far OOD, or if humans were to gain a lot of power[3], then the Shards would cease to be good[2:2] proxies.
In this world, we should expect the future to contain only a small fraction[4] of the "value" it would have, if humanity were fully "aligned"[2:3]. I.e. Evolution failed to "(robustly) align humanity".
-
(World B) The Shards (in combination with other structures in human DNA/brains) are in fact sufficiently robust that they will keep humanity aligned[2:4] even in the face of distributional shift and humans gaining vast optimization power.
In this world, we should expect the future to contain a large fraction of the "value" it would have, if humanity were fully "aligned"[2:5]. I.e. Evolution succeeded in "(robustly) aligning humanity".
-
(World C) Something else?
I think we're probably in (A), and IIUC, you think we're most likely in (B). Do you consider this an adequate characterization?
If yes, the obvious next question would be: What tests could we run, what observations could we make,[5] that would help us discern whether we're in (A) or (B) (or (C))?
(For example: I think the kinds of observations I listed in my previous comment are moderate-to-strong evidence for (A); and the existence of some explicit-IGF-maximizing humans is weak evidence for (B).)
Environment of evolutionary adaptedness. For humans: hunter-gatherer tribes on the savanna, or maybe primitive subsistence agriculture societies. ↩︎
in the sense of optimizing for IGF, or whatever we're imagining Evolution to "care" about. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
e.g. ability to upload their minds, construct virtual worlds, etc. ↩︎
Possibly (but not necessarily) still a large quantity in absolute terms. ↩︎
Without waiting a possibly-long time to watch how things in fact play out. ↩︎
vast computation some of which is applied to ancestral simulations
I agree that a successful post-human world would probably involve a large amount[1] of resources spent on simulating (or physically instantiating) things like humans engaging in play, sex, adventure, violence, etc. IOW, engaging in the things for which Evolution installed Shards in us. However, I think that is not the same as [whatever Evolution would care about, if Evolution could care about anything]. For the post-human future to be a success from Evolution's perspective, I think it would have to be full of something more like [programs (sentient or not, DNA or digital) striving to make as many copies of themselves as possible].
(If we make the notion of "DNA" too broad/vague, then we could interpret almost any future outcome as "success for Evolution".)
a large absolute amount, but maybe not a large relative amount. ↩︎
Humans have not put an end to biological life.
Yup. I, too, have noticed that.
Your doom[1] predictions [...]
C'mon, man, that's obviously a misrepresentation of what I was saying. Or maybe my earlier comment failed badly at communication? In case that's so, here's an attempted clarification (bolded parts added):
If Evolution had a lot more time (than I expect it to have) to align humans to relative-gene-replication-count, before humans put an end to biological life , as they seem to me to be on track to do, based on things I have observed in the past, then [...] But Evolution (almost surely) does not have infinite time [...]
Point being: Sure, Evolution managed to cough up some individuals who explicitly optimize for IGF. But they're exceptions, not the rule; and humanity seems (based on past observations!) to be on track to (mostly) end DNA-based life. So it seems premature to say that Evolution succeeded at aligning humanity.
In case you're wondering what past observations lead me to think that humans are unaligned[2] w.r.t. IGF and on track to end (or transcend) biological life, here are some off the top of my head:
-
Of the people whose opinions on the subject I'm aware of (including myself), nearly all would like to transcend (or end) biological life.[3]
-
Birth rates in most developed nations have been low or below replacement for a long time.[4] There seems to be a negative correlation between wealth/education and number of offspring produced. That matches my impression that as people gain wealth, education, and empowerment in general, most choose to spend it mostly on something other than producing offspring.
-
Diligent sperm bank donors are noteworthy exceptions. Most people are not picking obvious low-hanging fruit to increasing their IGF. Rich people waste money on yachts and stuff, instead of using it to churn out as many high-fitness offspring as possible; etc.
-
AFAIK, most of the many humans racing to build ASI are not doing so with the goal of increasing their IGF. And absent successful attempts to align ASI specifically to producing lots of DNA-based replicators, I don't see strong reason to expect the future to be optimized for quantity of DNA-based replicators.
Perhaps you disagree with the last point above?
There are many other scenarios where DNA flourishes even after a posthuman transition.
Interesting. Could you list a few of those scenarios?
Note: I wasn't even talking (only) about doom; I was talking about humanity seemingly being on track to end biological life. I think the "good" outcomes probably also involve transcending biology/DNA-based replicators. ↩︎
to the extent that it even makes sense to talk about incoherent things like humans being "(mis/un)aligned" to anything. ↩︎
My sample might not be super representative of humanity as a whole. Maybe somewhat representative of people involved in AI, though? ↩︎
At least according to sources like this: https://en.wikipedia.org/wiki/Total_fertility_rate ↩︎
evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! [...]
If Evolution had a lot more time to align humans to relative-gene-replication-count, before humans put an end to biological life, then sure, seems plausible that Evolution might be able to align humans very robustly. But Evolution does not have infinite time or "retries" --- humanity is in the process of executing something like a "sharp left turn", and seems likely to succeed long before the human gene pool is taken over by sperm bank donors and such.
The utility function is fitness: gene replication count (of the human defining genes) [1]
Seems like humans are soon going to put an end to DNA-based organisms, or at best relegate them to some small fraction of all "life". I.e., seems to me that the future is going to score very poorly on the gene-replication-count utility function, relative to what it would score if humanity (or individual humans) were actually aligned to gene-replication-count.
Do you disagree? (Do you expect the post-ASI future to be tiled with human DNA?)
Obviously Evolution doesn't actually have a utility function, and if it did, gene-replication-count is probably not it, as TekhneMakre points out. But, let's accept that for now, arguendo. ↩︎
I mostly agree.
I also think that impact is very unevenly distributed over people; the most impactful 5% of people probably account for >70% of the impact. [1]
And if so, then the difference in positive impact between {informing the top 5%} and {broadcasting to the field in general on the open Internet} is probably not very large. [2]
Possibly also worth considering: Would (e.g.) writing a public post actually reach those few key people more effectively than (e.g.) sending a handful of direct/targeted emails? [3]
Talking about AI (alignment) here, but I think something like this applies in many fields. I don't have a good quantification of "impact" in mind, though, so this is very hand-wavey. ↩︎
Each approach has its downsides. The first approach requires identifying the relevant people, and is likely more effortful. The latter approach has the downside of putting potentially world-ending information in the hands of people who would use it to end the world (a bit sooner than they otherwise would). ↩︎
What is in fact the most effective way to reach whoever needs to be reached? (I don't know.) ↩︎
If {the reasoning for why AGI might not be near} comprises {a list of missing capabilities}, then my current guess is that the least-bad option would be to share that reasoning in private with a small number of relevant (and sufficiently trustworthy) people[1].
(More generally, my priors strongly suggest keeping any pointers to AGI-enabling capabilities private.)
E.g. the most capable alignment researchers who seem (to you) to be making bad strategic decisions due to not having considered {the reasoning for why AGI might not be near}. ↩︎
I can't critique your plan, because I can't parse your writing. My suggestion would be to put some effort into improving the clarity of your writing. [1]
Even basic things, such as the avoidance of long sentences, sometimes with side notes included and separated from the main sentence by commas, rather than e.g. em dashes, and making the scopes of various syntactic structures unambiguous, could go a long way towards making your text more legible. ↩︎
[...] bridge the "gap" between (less-precise proofs backed by advanced intuition) and (precise proofs simple enough for basically anyone to technically "follow").
Meta: Please consider using curly or square brackets ({} or []) for conceptual/grammatic grouping; please avoid overloading parentheses.
Thumbs up for trying to think of novel approaches to solving the alignment problem.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.
A few confusions:
- By "it gets a loss function", did you mean "it gets negative reward"?
- If yes, doesn't this plan consist entirely of reinforcement learning? How does this "emulate Evolution"?
- What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to... changes in the dumber agents' (D's) total utility?
Some problems, off the top of my head:
-
GPT-like models don't have utility functions.
-
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models' utility functions.
-
Humans don't have utility functions. It's unclear how this would generalize to human-alignment.
-
It's very much unclear what policy S would end up learning in this RL setup. It's even less clear how that policy would generalize outside of training.
- If S is given reward proportional to (changes in) D's utility, then basically we're just training S with D's utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. [1]
- If S is instead given reward for things like {taking actions that lead to obtaining information about D's utility function}, then... we're training an RL policy/agent on proxies to "alignment". I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.
I don't know how you arrived at this plan, but I'm guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:
- https://www.lesswrong.com/posts/GKfPL6LQFgB49FEnv/replace-the-symbol-with-the-substance
- https://www.lesswrong.com/posts/JcpzFpPBSmzuksmWM/the-5-second-level
Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D's be very accurate models of humans. But that's not remotely realistic; and still subject to Goodhart. ↩︎
Tamsin Leake's project might match what you're looking for.
[...] iteratively align superintelligence.
To align the first automated alignment researcher, [...]
To validate the alignment of our systems, [...]
What do they mean by "aligned"?
How do we ensure AI systems much smarter than humans follow human intent?
OK. Assuming that
- sharp left turns are not an issue,
- and scalable oversight is even possible in practice,
- and OAI somehow solves the problems of
- AIs hacking humans (to influence their intents),
- and deceptive alignment,
- humans going crazy when given great power,
- etc.
- and all the problems no-one has noticed yet,
then, there's the question of "aligned to what"? Whose intent? What would success at this agenda look like?
Maybe: A superintelligence that accurately models its human operator, follows the human's intent[1] to complete difficult-but-bounded tasks, and is runnable at human-speed with manageable amount of compute, sitting on OAI's servers?
Who would get to use that superintelligence? For what purpose would they use it? How long before the {NSA, FSB, CCP, ...} steal that superintelligence off OAI's servers? What would they use it for?
Point being: If an organization is not adequate in all key dimensions of operational adequacy, then even if they somehow miraculously solve the alignment/control problem, they might be increasing S-risks while only somewhat decreasing X-risks.
What is OAI's plan for getting their opsec and common-good-commitment to adequate levels? What's their plan for handling success at alignment/control?
and does not try to hack the human into having more convenient intents ↩︎
To what extent would you expect the government's or general populace's responses to "Robots with guns" to be helpful (or harmful) for mitigating risks from superintelligence? (Would getting them worried about robots actually help with x-risks?)
Right; that would be a silly thing to think.
My intended message might've been better worded as follows
If staring into abysses is difficult/rough, then adequately staring into the darker abysses might require counter-intuitively large amounts of effort/agency. And yet, I think it might be necessary to grok those darker abysses, if we are to avoid falling into them. That makes me worried.
OTOH, you seem exceptionally reflective, so perhaps that worry is completely unfounded in your case. Anyway, I'm grateful for the work you do; I wish there were more people like you in this world. (Also, your attention is extremely valuable, so please feel free to promptly drop/forget this conversation.)
When people call things like this post "rough to write/read", and consider them to require a content warning, I wonder if most people are able to think clearly (or at all) about actually terrible scenarios, and worry that they aren't. (I'm especially worried if those people have influence in a domain where there might be a tradeoff between mitigating X-risks vs mitigating S-risks.)
I liked the description of the good future, though. Thanks for the reminder that things can (maybe) go well, too.
Thanks for the response.
To the extent that I understand your models here, I suspect they don't meaningfully bind/correspond to reality. (Of course, I don't understand your models at all well, and I don't have the energy to process the whole post, so this doesn't really provide you with much evidence; sorry.)
I wonder how one could test whether or not the models bind to reality? E.g. maybe there are case examples (of agents/people behaving in instrumentally rational ways) one could look at, and see if the models postdict the actual outcomes in those examples?
Other agents are not random though.
I agree. But AFAICT that doesn't really change the conclusion that less agents would tend to make the world more predictable/controllable. As you say yourself:
I don't think other agents are different from any other physical system that might be more-or-less chaotic, unpredictable or difficult to control.
And that was the weaker of the two apparent problems. What about the {implied self-preservation and resource acquisition} part?
Firstly, a bias towards choices which leave less up to chance.
Wouldn't this imply a bias towards eliminating other agents? (Since that would make the world more predictable, and thereby leave less up to chance?)
And thirdly, a bias towards choices which afford more choices later on.
Wouldn't this strongly imply biases towards both self-preservation and resource acquisition?
If the above two implications hold, then the conclusion
that the biases induced by instrumental rationality at best weakly support [...] that machine superintelligence is likely to lead to existential catastrophe
seems incorrect, no?
Could you briefly explain what is wrong with the reasoning above, or point me to the parts of the post that do so? (I only read the Abstract.)
[...] bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.
What probability would you assign to humans remaining but not being able to kill themseleves; i.e., to unescapable dystopias (vs. dystopias whose badness for any individual are bounded by death-by-suicide)?
This post raised some interesting points, and stimulated a bunch of interesting discussion in the comments. I updated a little bit away from foom-like scenarios and towards slow-takeoff scenarios. Thanks. For that, I'd like to upvote this post.
On the other hand: I think direct/non-polite/uncompromising argumentation against other arguments, models, or beliefs is (usually) fine and good. And I think it's especially important to counter-argue possible inaccuracies in key models that lots of people have about AI/ML/alignment. However, in many places, the post reads like a personal attack on a person (Yudkowsky), rather than just on models/beliefs he has promulgated.
I think that style of discourse runs a risk of
- politicizing the topic under discussion, and thereby making it harder for people to think clearly about it
- creating a shitty culture where people are liable to get personally attacked for participating in that discourse
For that, I'd like to downvote this post. (I ended up neither up- nor down-voting.)
I will bet any amount of money that GPT-5 will not kill us all.
What's the exchange rate for USD to afterlife-USD, though? Or what if they don't use currency in the afterlife at all? Then how would you pay the other party back if you lose?
Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?
If we're literally considering a universal quantifier over the set of all possible ML models, then I'd think it extremely likely that there does exist a simpler model with perplexity no worse than GPT-4. I'm confused as to how you (seem to have) arrived at the opposite conclusion.
Imagine an astronomer in the year 1600, who frequently refers to the "giant inscrutable movements" of the stars. [...]
I think the analogy to {building intelligent systems} is unclear/weak. There seem to be many disanalogies:
In the astronomy case, we have
-
A phenomenon (the stars and their motions) that we cannot affect
-
That phenomenon is describable with simple laws.
-
The "piles of numbers" are detailed measurements of that phenomenon.
-
It is useful to take more measurements, doing so is helpful for finding the simple laws.
In the {building AIs} case, we have
-
A phenomenon (intelligence) which we can choose to implement in different ways, and which we want to harness for some purpose.
-
That phenomenon is probably not entirely implementable with simple programs. (As you point out yourself.)
-
The "piles of numbers" are the means by which some people are implementing the thing; as opposed to measurements of the one and only way the thing is implemented.
-
And so: implementing AIs as piles-of-numbers is not clearly helpful (and might be harmful) to finding better/simpler alternatives.
So, I predict with high confidence that any ML model that can reach the perplexity levels of Transformers will also present great initial interpretive difficulty.
I do agree that any realistic ML model that achieves GPT-4-level perplexity would probably have to have at least some parts that are hard to interpret. However, I believe it should (in principle) be possible to build ML systems that have highly interpretable policies (or analogues thereof), despite having hard-to-interpret models.
I think if our goal was to build understandable/controllable/safe AI, then it would make sense to factor the AI's mind into various "parts", such as e.g. a policy, a set of models, and a (set of sub-)goals.
In contrast, implementing AIs as giant Transformers precludes making architectural distinctions between any such "parts"; the whole AI is in a(n architectural) sense one big uniform soup. Giant Transformers don't even have the level of modularity of biological brains designed by evolution.
Consequently, I still think the "giant inscrutable tensors"-approach to building AIs is terrible from a safety perspective, not only in an absolute sense, but also in a relative sense (relative to saner approaches that I can see).
scale up the experiment of Pretraining from Human Feedback by using larger data
AFAICT, PHF doesn't solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I'd tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven't thought about PHF very thoroughly though, so I'm uncertain here.
we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words
Did you mean something like "(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc."? [1]
can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment
I don't understand this. What difference are you pointing at with "deceptive" vs "legitimate" generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning "deceptive" generalizations}?
I'm under the impression that entirely "legitimate" generalizations can (and apriori probably will) lead to "deception"; see e.g. https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness. Do you disagree with that? (If yes, how?)
can't amplify Goodhart
Side note: I don't understand what you mean by this (in the given context).
can't [...] hack the human's values
I don't see how this follows. IIUC, the proposition here is something like
- If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can't hack the humans.
Is that a reasonable representation of what you're saying? If yes, consider: What if we replace "the AI" with "Anonymous" and "the humans" with "the web server"? Then we get
- If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can't hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn't rely on hardware effects like rowhammering.
(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it's far from sufficient (to guarantee any form of safety).)
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I'm still confused as to why you think that. To the extent that I understood the reasons you presented, I think they're incorrect (as outlined above). (Maybe I'm misunderstanding something.)
I'm kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons. ↩︎
I don't quite understand what you're saying; I get the impression we're using different ontologies/vocabularies. I'm curious to understand your model of alignment, and below are a bunch of questions. I'm uncertain whether it's worth the time to bridge the conceptual gap, though --- feel free to drop this conversation if it feels opportunity-costly.
(1.)
Are you saying that if we assumed agents to be Cartesian[1], then you'd know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?
(2.)
Resolve the embedded alignment problems [...] This is essentially Reinforcement Learning from Human Feedback's method for alignment
How does RLHF solve problems of embedded alignment? I'm guessing you're referring to something other than the problems outlined in Embedded Agency?
(3.)
What exact distinction do you mean by "online" vs "offline"? Given that any useful/"pivotal" AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn "online", no?
(4.)
the data set used for human values
What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?
e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation. ↩︎
the more we crank up generalization ability, the better it's alignment
To me that seems almost correct, in a way that is dangerous. I'd agree with the statement that
the more we crank up generalization ability, the better the AI's ability to align to any given set of values/goals
But for that to lead to the AI being aligned with "good" values, we also need to somehow get the AI to choose/want to align with "good" values. (Whatever "good" even means; maybe humanity's CEV?) And that does not happen on its own, AFAICT.
I think it might be relevant to note here that it's not really humans who are building current SOTA AIs --- rather, it's some optimizer like SGD that's doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky's claim that it's hard/inefficient to build something without understanding it.
If Eliezer wasn't meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then [...]
I think it's important to distinguish between
-
Scaling up a neural network, and running some kind of fixed algorithm on it.
-
Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms.
IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don't know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so... *shrugs*.
- A task X reduces to task Y if and only if...
Shouldn't X and Y be the other way around there?
I feel like this conversation might be interesting to continue, if I had more bandwidth, but I don't. In any case, thanks for the linked article, looks interesting based on the abstract.
Whatever happened here is an interesting datapoint about [...]
I think using the word "interesting" here is kinda assuming the conclusion?
Whatever happened here is a datapoint about the long-term evolution of thermodynamic systems away from equilibrium.
Pretty much all systems in the universe can be seen as "thermodynamic systems". And for a system to evolve at all, it necessarily has to be away from equilibrium. So it seems to me that that sentence is basically saying
"Whatever happened here is a datapoint about matter and energy doing their usual thing over a long period of time."
And... I don't see how that answers the question "why would an ASI find it interesting?"
From the biological anchors paper [...] the point is that what happened here is not so trivial or boring that its clear that an ASI would not have any interest in it.
I agree that a lot of stuff has happened. I agree that accurately simulating the Earth (or even just the biological organisms on Earth) is not trivial. What I don't see (you making an actual argument for) is why all those neural (or other) computations would be interesting to an ASI. [1]
I'm sure people have written more extensively about this, about an ASI freezing some selection of the human population for research purposes or whatever.
Right. That sounds like a worse-than-death scenario. I agree those are entirely plausible, albeit maybe not the most likely outcomes. I'd expect those to be caused by the AI ending up with some kind of human-related goals (due to being trained with objectives like e.g. "learn to predict human-generated text" or "maximize signals of approval from humans"), rather than by the ASI spontaneously developing a specific interest in the history of how natural selection developed protein-based organic machines on one particular planet.
I just find the idea that the ASI will want my atoms for something trivial, when [...]
As mentioned above, I'd agree that there's some chance that an Earth-originating ASI would end up with a goal of "farming" (simulated) humans for something (e.g. signals of approval), but I think such goals are unlikely a priori. Why would an ASI be motivated by "a grand exploration of the extremes of thermodynamics" (whatever that even means)? (Sounds like a waste of energy, if your goal is to (e.g.) maximize the number of molecular squiggles in existence.) Are you perhaps typical-minding/projecting your own (laudable) human wonder/curiosity onto a hypothetical machine intelligence?
Analogy: If you put a few kilograms of fluid in a box, heat it up, and observe it for a few hours, the particles will bop around in really complicated ways. Simulating all those particle interactions would take a huge amount of computation, it would be highly non-trivial. And yet, water buckets are not particularly exciting or interesting. Complexity does not imply "interestingness". ↩︎
Why would an ASI be interested in the Earth's ecosystems?
Why not both? Why leave value lying around? (Also, the asteroid belt containing Ceres and Vesta contains several orders of magnitude less matter than Earth. Maybe you meant "why not go colonize the Milky Way and other galaxies"?)
I agree with some parts of what (I think) you're saying; but I think I disagree with a lot of it. My thoughts here are still blurry/confused, though; will need to digest this stuff further. Thanks!
I really wish there was an "agree/disagree" button for posts. I'd like to upvote this post (for epistemic virtue / presenting reasonable "contrarian" views and explaining why one holds them), but I also strongly disagree with the conclusions and suggested policies. (I ended up voting neither up nor down.)
EDIT: After reading Akash's comments, and re-reading the post more carefully: I largely agree with Akash (and updated towards thinking that my standards for "epistemic virtue" are/were too low).
Thanks for the thoughtful answer. After thinking about this stuff for a while, I think I've updated a bit towards thinking (i.a.) that alignment might not be quite as difficult as I previously believed.[1]
I still have a bunch of disagreements and questions, some of which I've written below. I'd be curious to hear your thoughts on them, if you feel like writing them up.
I.
I think there's something like an almost-catch-22 in alignment. A simplified caricature of this catch-22:
-
In order to safely create powerful AI, we need it to be aligned.
-
But in order to align an AI, it needs to have certain capabilities (it needs to be able to understand humans, and/or to reason about its own alignedness, or etc.).
-
But an AI with such capabilities would probably already be dangerous.
Looking at the paragraphs quoted below,
I now realize that I have an additional assumption that I didn't explicitly put in the post, which is something like... alignment and capabilities may be transmitted simultaneously. [...]
[...] a successive series of more sophisticated AI systems that get gradually better at understanding human preferences and being aligned with them (the way we went from GPT-1 to GPT-3.5)
I'm pattern-matching that as proposing that the almost-catch-22 is solvable by iteratively
-
1.) incrementing the AI's capabilities a little bit
-
2.) using those improved capabilities to improve the AI's alignedness (to the extent possible); goto (1.)
Does that sound like a reasonable description of what you were saying? If yes, I'm guessing that you believe sharp left turns are (very) unlikely?
I'm currently quite confused/uncertain about what happens (notably: whether something like sharp left turns happen) when training various kinds of (simulator-like) AIs to very high capabilities. But I feel kinda pessimistic about it being practically possible to implement and iteratively interleave steps (1.)-(2.) so as to produce a powerful aligned AGI. If you feel optimistic about it, I'm curious as to what makes you optimistic.(?)
II.
I don't know how exactly we get from the part where the AI is modeling the human to the part where it actively wants to fulfill the human's preferences [...]
I think {the stuff that paragraph points at} might contain a crux or two.
One possible crux:
[...] that drive shows up in human infants and animals, so I'd guess that it wouldn't be very complicated. [2]
Whereas I think it would probably be quite difficult/complicated to specify a good version of D := {(a drive to) fulfill human H's preferences, in a way H would reflectively endorse}. Partly because humans are inconsistent, manipulable, and possessed of rather limited powers of comprehension; I'd expect the outcome of an AI optimizing for D to depend a lot on things like
-
just what kind of reflective process the AI was simulating H to be performing
-
in what specific ways/order did the AI fulfill H's various preferences
-
what information/experiences the AI caused H to have, while going about fulfilling H's preferences.
I suspect one thing that might be giving people the intuition that {specifying a good version of D is easy} is the fact that {humans, dogs, and other agents on which people's intuitions about PF are based} are very weak optimizers; even if such an agent had a flawed version of D, the outcome would still probably be pretty OK. But if you subject a flawed version of D to extremely powerful optimization pressure, I'd expect the outcome to be worse, possibly catastrophic.
And then there's the issue of {how do you go about actually "programming" a good version of D in the AI's ontology, and load that program into the AI, as the AI gains capabilities?}; see (I.).
Maybe something like it being rewarding if the predictive model finds actions that the adult/human seems to approve of.
I think this would run into all the classic problems arising from {rewarding proxies to what we actually care about}, no? Assuming that D := {understand human H's preferences-under-reflection (whatever that even means), and try to fulfill those preferences} is in fact somewhat complex (note that crux again!), it seems very unlikely that training an AI with a reward function like R := {positive reward whenever H signals approval} would generalize to D. There seem to be much, much simpler (a priori more likely) goals Y to which the AI could generalize from R. E.g. something like Y := {maximize number of human-like entities that are expressing intense approval}.
III.
Side note: I'm weirded out by all the references to humans, raising human children, etc. I think that kind of stuff is probably not practically relevant/useful for alignment; and that trying to align an ASI by... {making it "human-like" in some way} is likely to fail, and probably has poor hyperexistential separation to boot. I don't know if anyone's interested in discussing that, though, so maybe better I don't waste more space on that here. (?)
To put some ass-numbers on it, I think I'm going from something like
- 85% death (or worse) from {un/mis}aligned AI
- 10% some humans align AI and become corrupted, Weak Dystopia ensues
- 5% AI is aligned by humans with something like security mindset and cosmopolitan values, things go very well
to something like
- 80% death (or worse) from {un/mis}aligned AI
- 13% some humans align AI and become corrupted, Weak Dystopia ensues
- 7% AI is aligned by humans with something like security mindset and cosmopolitan values, things go very well
Side note: The reasoning step {drive X shows up in animals} -> {X is probably simple} seems wrong to me. Like, Evolution is stupid, yes, but it's had a lot of time to construct prosocial animals' genomes; those genomes (and especially the brains they produce in interaction with huge amounts of sensory data) contain quite a lot of information/complexity. ↩︎
Interesting!
Those complicated things are only complicated because we don't introspect about them hard enough, not for any intrinsic reasons.
My impression is that the human brain is in fact intrinsically quite complex![1]
I also think most people just don't have enough self-awareness to be able to perceive their thoughts forming and get a sense of the underlying logic.
I think {most people's introspective abilities} are irrelevant. (But FWIW, given that lots of people seem to e.g. conflate a verbal stream with thought, I agree that median human introspective abilities are probably kinda terrible.)
Consider all the psychological wisdom of Buddhism [...]
Unfortunately I'm not familiar with the wisdom of Buddhism; so that doesn't provide me with much evidence either way :-/
An obvious way to test how complex a thing X really is, or how well one understands it, is to (attempt to) implement it as code or math. If the resulting software is not very long, and actually captures all the relevant aspects of X, then indeed X is not very complex.
Are you able to write software that implements (e.g.) kindness, prosociality, or "an entity that cares intrinsically about other entities"[2]? Or write an informal sketch of such math/code? If yes, I'd be very curious to see it! [3]
Like, even if only ~1% of the information in the human genome is about how to wire the human brain, that'd still be ~10 MB worth of info/code. And that's just the code for how to learn from vast amounts of sensory data; an adult human brain would contain vastly more structure/information than that 10 MB. I'm not sure how to estimate how much, but given the vast amount "training data" and "training time" that goes into a human child, I wouldn't be surprised if it were in the ballpark of hundreds of terabytes. If even 0.01% of that info is about kindness/prosociality/etc., then we're still talking of something like 10 GB worth of information. This (and other reasoning) leads me to feel moderately sure that things like "kindness" are in fact rather complex. ↩︎
...and hopefully, in addition to "caring about other entities", also tries to do something like "and implement the other entities' CEV". ↩︎
Please don't publish anything infohazardous, though, obviously. ↩︎