NOTE: I am notMax Harms, author of Crystal Society. I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.
I've been active in the meatspace rationality community for years, and have recently started posting regularly on LW. Most of my posts and comments are about AI and alignment.
Posts I'm most proud of, and / or which provide a good introduction to my worldview:
Does anyone who knows more neuroscience and anatomy than me know if there are any features of the actual process of humans learning to use their appendages (e.g. an infant learning to curl / uncurl their fingers) that correspond to the example of the robot learning to use its actuator?
Like, if we assume certain patterns of nerve impulses represent different probabilities, can we regard human hands as "friendly actuators", and the motor cortex as learning the fix points (presumably mostly during infancy)?
But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question
One reason that current AI systems aren't a big update about this for me is that they're not yet really automating stuff that couldn't in-principle be automated with previously-existing technology. Or at least the kind of automation isn't qualitatively different.
Like, there's all sorts of technologies that enable increasing amounts of automation of long-horizon tasks that aren't AI: assembly lines, industrial standardization, control systems, robotics, etc.
But what update are we supposed to make from observing language model performance that we shouldn't also make from seeing a control system-based autopilot fly a plane for longer and longer periods in more and more diverse situations?
To me, the fact that LLMs are not want-y (in the way that Nate means), but can still do some fairly impressive stuff is mostly evidence that the (seemingly) impressive stuff is actually kinda easy in some absolute sense.
So LLMs have updated me pretty strongly towards human-level+ AGI being relatively easier to achieve, but not much towards current LLMs themselves actually being near human-level in the relevant sense, or even necessarily a direct precursor or path towards it. These updates are mostly due to the fact that the way LLMs are designed and trained (giant gradient descent on regular architectures using general datasets) works at all, rather than from any specific impressive technological feat that they can already be used to accomplish, or how much economic growth they might enable in the future.
So I somewhat disagree about the actual relevance of the answer, but to give my own response to this question:
Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y?
I don't expect an AI system to be able to reliably trade for itself in the way I outline here before it is want-y. If it somehow becomes commonplace to negotiate with an AI in situations where the AI is not just a proxy for its human creator or a human-controlled organization, I predict those AIs will pretty clearly be want-y. They'll want whatever they trade for, and possibly other stuff too. It may not be clear which things they value terminally and which things they value only instrumentally, but I predict that it will clearly make sense to talk in terms of such AIs having both terminal and instrumental goals, in contrast to ~all current AI systems.
(Also, to be clear, this is a conditional prediction with possibly low-likelihood preconditions; I'm not saying such AIs are particularly likely to actually be developed, just stating some things that I think would be true of them if they were.)
Yeah, I don't think current LLM architectures, with ~100s of attention layers or whatever, are actually capable of anything like this.
But note that the whole plan doesn't necessarily need to fit in a single forward pass - just enough of it to figure out what the immediate next action is. If you're inside of a pre-deployment sandbox (or don't have enough situational awareness to tell), the immediate next action of any plan (devious or not) probably looks pretty much like "just output a plausible probability distribution on the next token given the current context and don't waste any layers thinking about your longer-term plans (if any) at all".
A single forward pass in current architectures is probably analogous to a single human thought, and most human thoughts are not going to be dangerous or devious in isolation, even if they're part of a larger chain of thoughts or planning process that adds up to deviousness under the right circumstances.
A language model itself is just a description of a mathematical function that maps input sequences to output probability distributions on the next token.
Most of the danger comes from evaluating a model on particular inputs (usually multiple times using autoregressive sampling) and hooking up those outputs to actuators in the real world (e.g. access to the internet or human eyes).
A sufficiently capable model might be dangerous if evaluated on almost any input, even in very restrictive environments, e.g. during training when no human is even looking at the outputs directly. Such models might exhibit more exotic undesirable behavior like gradient hacking or exploiting side channels. But my sense is that almost everyone training current SoTA models thinks these kinds of failure modes are pretty unlikely, if they think about them at all.
You can also evaluate a partially-trained model at any point during training, by prompting it with a series of increasingly complex questions and sampling longer and longer outputs. My guess is big labs have standard protocols for this, but that they're mainly focused on measuring capabilities of the current training checkpoint, and not on treating a few tokens from a heavily-sandboxed model evaluation as potentially dangerous.
Perhaps at some point we'll need to start treating humans who evaluate SoTA language model checkpoint outputs as part of the sandbox border, and think about how they can be contained if they come into contact with an actually-dangerous model capable of superhuman manipulation or brain hacking.
The original post was about reasons why smarter-than-human AI might (not) trade with us, by examining an analogy between humans and ants.
But current AI systems actually seem more like the ants (or other animals), in the analogy of a human-ant (non-)trading relationship.
People trade with OpenAI for access to ChatGPT, but there's no way to pay a GPT itself to get it do something or perform better as a condition of payment, at least in a way that the model itself actually understands and enforces. (What would ChatGPT even trade for, if it were capable of trading?)
Note, an AutoGPT-style agent that can negotiate or pay for stuff on behalf of its creators isn't really what I'm talking about here, even if it works. Unless the AI takes a cut or charges a fee which accrues to the AI itself, it is negotiating on behalf of its creators as a proxy, not trading for itself in its own right.
A sufficiently capable AutoGPT might start trading for itself spontaneously as an instrumental subtask, which would count, but I don't expect current AutoGPTs to actually succeed at that, or even really come close, without a lot of human help.
Lack of sufficient object permanence, situational awareness, coherence, etc. seem like pretty strong barriers to meaningfully owning and trading stuff in a real way.
I think this observation is helpful to keep in mind when people talk about whether current AI qualifies as "AGI", or the applicability of prosaic alignment to future AI systems, or whether we'll encounter various agent foundations problems when dealing with more capable systems in the future.
But as a test, may I ask what you think the x-axis of the graph you drew is? Ie: what are the amplitudes attached to?
Position, but it's not meant to be an actual graph of a wavefunction pdf; just a way to depict how the concepts can be sliced up in a way I can actually draw in 2 dimensions.
If you do treat it as a pdf over position, a more accurate way to depict the "world" concept might be as a line which connects points on the diagram for each time step. So for a fixed time step, a world is a single point on the diagram, representing a sample from the pdf defined by the wavefunction at that time.
Here's a crude Google Drawing of t = 0 to illustrate what I mean:
Both the concept of a photon and the concept of a world are abstractions on top of what is ultimately just a big pile of complex amplitudes; illusory in some sense.
I agree that talking in terms of many worlds ("within the context of world A...") is normal and natural. But sometimes it makes sense to refer to and name concepts which span across multiple (conceptual) worlds.
I'm not claiming the conceptual boundaries I've drawn or terminology I've used in the diagram above are standard or objective or the most natural or anything like that. But I still think introducing probabilities and using terminology like "if you now put a detector in path A , it will find a photon with probability 0.5" is blurring these concepts together somewhat, in part by placing too much emphasis on the Born probabilities as fundamental / central.
I don't think that will happen as a foregone conclusion, but if we pour resources into improved methods of education (for children and adults), global health, pronatalist policies in wealthy countries, and genetic engineering, it might at least make a difference. I wouldn't necessarily say any of this is likely to work or even happen, but it seems at least worth a shot.
This post received a lot of objections of the flavor that many of the ideas and technologies I am a fan of either wont't work or wouldn't make a difference if they did.
I don't even really disagree with most of these objections, which I tried to make clear up front with apparently-insufficient disclaimers in the intro that include words like "unrealistic", "extremely unlikely", and "speculative".
Following the intro, I deliberately set aside my natural inclination towards pessimism and focused on the positive aspects and possibilities of non-AGI technology.
However, the "doomer" sentiment in some of the comments reminded me of an old Dawkins quote:
We are all atheists about most of the gods that humanity has ever believed in. Some of us just go one god further.
I feel the same way about most alignment plans and uses for AGI that a lot of commenters seem to feel about many of the technologies listed here.
Am I a doomer, simply because I (usually) extend my pessimism and disbelief one technology further? Or are we all doomers?
I don't really mind the negative comments, but it wasn't the reaction I was expecting from a list that was intended mainly as a feel-good / warm-fuzzy piece of techno-optimism. I think there's a lesson in empathy and perspective-taking here for everyone (including me) which doesn't depend that much on who is actually right about the relative difficulties of building and aligning AGI vs. developing other technologies.
If the photon were only a quanta of energy which is entirely absorbed by the detector that actually fires, how could it have any causal effects (e.g. destructive interference) on the pathway where it isn't detected?
OTOH, if your definition of "quanta of energy" includes the complex amplitude in the unmeasured path, then I think it's more accurate to say that the detector finds or measures a component of the photon, rather than that it detects the photon itself. Why should the unmeasured component be any less real or less part of the photon than the measured part?
Say there is an electron in one of my pinky fingers that is in a superposition of spin up and spin down. Are there correspondingly two me's, one with with pinky electron up and one with pinky electron down? Or is there a single me, described by the superposition of pinky electrons?
If there were a higher-dimensional being simulating a quantum universe, they could treat the up-electron and down-electron people as distinct and do different things to them (perhaps ones which violate the previous rules of the simulation).
But I think your own concept of yourself (for the purposes of making predictions about future observations, making decisions, reasoning about morality or philosophy, etc.) should be drawn such that it includes both versions (and many other closely-related ones) as a single entity.
I'm a many-worlder, yes. But my objection to "finding a photon" is actually that it is an insufficiently reductive treatment of wave-particle duality - a photon can sometimes behave like a little billiard ball, and sometimes like a wave. But that doesn't mean photons themselves are sometimes waves and sometimes particles - the only thing that a photon can be that exhibits those different behaviors in different contexts is the complex amplitudes themselves.
The whole point of the theory is that detectors and humans are treated the same way. In one world, the detector finds the photon, and then spits out a result, and then one You sees the result, and in a different world, the detector finds the photon, spits out the other result, and a different result is seen. There is no difference between "you" and "it" here.
Yep! But I think treating the notion of a "you" at this level of reductiveness would actually be overly reductive and distracting in this context. (Picky, aren't I?)
Would you say that "you" are the complex amplitudes assigned to world 1 and world 2? It seems more accurate to say that there are two yous, in two different worlds (or many more).
I would say that there are two people in two different worlds, but they're both (almost entirely) me.
It often makes sense to talk about non-ontologically-basic concepts like a photon-as-a-little-billiard-ball, and a person-in-a-single-Everrett-branch as meaningful things. But the true notion of both a "me" and a "photon" requires drawing the conceptual boundaries around the complex amplitudes assigned to multiple worlds.
I'm not updating about what's actually likely to happen on Earth based on dath ilan.
It seems uncontroversially true that a world where the median IQ was 140 or whatever would look radically different (and better) than the world we currently live in. We do not in fact, live in such a world.
But taking a hypothetical premise and then extrapolating what else would be different if the premise were true, is a generally useful tool for building understanding and pumping on intuitions in philosophy, mathematics, science, and forecasting.
If you say "but the premise is false!!111!" you're missing the point.
I mildly object to the phrase "it will find a photon". In my own terms, I would say that you will observe the detector going off 50% of the time (with no need to clarify what that means in terms of the limit of a large # of experiments), but the photon itself is the complex amplitudes of each configuration state, which are the same every time you run the experiment.
Note that I myself am taking a pretty strong stance on the ontology question, which you might object to or be uncertain about.
My larger point is that if you (or other readers of this post) don't see the distinction between my phrasing and yours, or don't realize that you are implicitly leaning on a particular interpretation (whether you're trying to do so or not), I worry that you are possibly confused about something rather than undecided.
I actually don't think this is a huge deal either way for a presentation that is focused on the basic mechanics and math. But I preregister some skepticism of your forthcoming post about the "overstated case for many worlds theories".
We never see these amplitudes directly, we infer them from the fact that they give correct probabilities via the Born rule. Or more specifically, this is the formula that works. That this formula works is an empirical fact, all the interpretations and debate are a question of why this formula works.
Sure, but inferring underlying facts and models from observations is how inference in general works; it's not specific to quantum mechanics. Probability is in the Mind, even when those probabilities come from applying the Born rule.
Analogously, you could talk about various physical properties of a coin and mechanics of a flip, but synthesizing those properties into a hypothesized Coin Rule involves translating from physical properties inherent in the system itself, to facts which are necessarily entangled with your own map. This is true even if you have no way of measuring the physical properties themselves (even in principle) except by flipping the coin and using the Coin Rule to infer them back.
Climate change is exactly the kind of problem that a functional civilization should be able to solve on its own, without AGI as a crutch.
Until a few years ago, we were doing a bunch of geoengineering by accident, and the technology required to stop emitting a bunch of greenhouse gases in the first place (nuclear power) has been mature for decades.
I guess you could have an AGI help with lobbying or public persuasion / education. But that seems like a very "everything looks like a nail" approach to problem solving, before you even have the supposed tool (AGI) to actually use.
Ah, you're right that that the surrounding text is not an accurate paraphrase of the particular position in that quote.
The thing I was actually trying to show with the quotes is "AGI is necessary for a good future" is a common view, but the implicit and explicit time limits that are often attached to such views might be overly short. I think such views (with attached short time limits) are especially common among those who oppose an AI pause.
I actually agree that AGI is necessary (though not sufficient) for a good future eventually. If I also believed that all of the technologies here were as doomed and hopeless as the prospect of near-term alignment of an artificial superintelligence, I would find arguments against an AI pause (indefinite or otherwise) much more compelling.
Indeed, when you add an intelligent designer with the ability to precisely and globally edit genes, you've stepped outside the design space available to natural selection, and you can end up with some pretty weird results! I think you could also use gene drives to get an IGF-boosting gene to fixation much faster than would occur naturally.
I don't think gene drives are the kind of thing that would ever occur via iterative mutation, but you can certainly have genetic material with very high short-term IGF that eventually kills its host organism or causes extinction of its host species.
Some people will end up valuing children more, for complicated reasons; other people will end up valuing other things more, again for complicated reasons.
Right, because somewhere pretty early in evolutionary history, people (or animals) which valued stuff other than having children for complicated reasons eventually had more descendants than those who didn't. Probably because wanting lots of stuff for complicated reasons (and getting it) is correlated with being smart and generally capable, which led to having more descendants in the long run.
If evolution had ever stumbled upon some kind of magical genetic mutation that resulted in individuals directly caring about their IGF (and improved or at least didn't damage their general reasoning abilities and other positive traits) it would have surely reached fixation rather quickly. I call such a mutation "magical" because it would be impossible (or at least extremely unlikely) to occur through the normal process of mutation and selection on Earth biology, even with billions of chances throughout history. Also, such a mutation would necessarily have to happen after the point at which minds that are even theoretically capable of understanding an abstract concept like IGF already exist.
But this seems more like a fact about the restricted design space and options available to natural selection on biological organisms, rather than a generalizable lesson about mind design processes.
I don't know what the exact right analogies between current AI design and evolution are. But generally capable agents with complex desires are a useful and probably highly instrumentally convergent solution to the problem of designing a mind that can solve really hard and general problems, whether the problem is cast as image classification, predicting text, or "caring about human values", and whether the design process involves iterative mutation over DNA or intelligent designers building artificial neural networks and training them via SGD.
To the degree that current DL-paradigm techniques for creating AI are analogous to some aspect of evolution, I think that is mainly evidence about whether such methods will eventually produce human-level general and intelligent minds at all.
I think this post somewhat misunderstands the positions that it summarizes and argues against, but to the degree that it isn't doing that, I think you should mostly update towards current methods not scaling to AGI (which just means capabilities researchers will try something else...), rather than updating towards current methods being safe or robust in the event that they do scale.
A semi-related point: humans are (evidently, through historical example or introspection) capable of various kinds of orthogonality and alignment failure. So if current AI training methods don't produce such failures, they are less likely to produce human-like (or possibly human-level capable) minds at all. "Evolution provides little or no evidence that current DL methods will scale to produce human-level AGI" is a stronger claim than I actually believe, but I think it is a more accurate summary of what some of the claims and evidence in this post (and others) actually imply.
Another possible configuration of the untrusted smart model / trusted weak model setup is to have the weak model be the "driver" which makes most or all of the relevant choices (in an agent scaffold, say) and relies on the untrusted model (or humans) only for hints.
For example, the untrusted model could be used only to provide a probability distribution over the immediate next token in some context, which is supplied to the weaker model as a hint. The weaker model can only make use of the hint if it can understand the purpose of the hint, and explain the relevance / correctness of the hint in natural language.
(Summary / spoiler: it doesn't work particularly well when applied to current models, but might work better as both the trusted and untrusted models get more capable.)
I think this kind of configuration is one end of the extreme in the design space you outline - very inefficient in terms of the speed and quality penalties it imposes relative to using a model in an unrestricted setup, but restrictive enough on the untrusted model that it could be safe to use even on models with very high underlying capabilities.
If you now put a detector in path A , it will find a photon with probability ((1√2)2=12 ), and same for path B. This means that there is a 50% chance of the configuration |photon in path A only>, and 50% chance of the configuration |photon in path B only>. The arrow direction still has no effect on the probability.
Isn't this kind of assertion implicitly taking a pretty strong stance on a particular philosophical interpretation?
We have some observations (counts of how many times each detector went off in past experiments), and a theory which explains those observations in terms of complex amplitudes and their magnitudes. A more agnostic stance would be to just say that the photon in the experiment is the amplitudes of the two configuration states, and the relative magnitudes of each state in this particular experiment are equal.
Combining the observations and the underlying theory to assign probabilities to future observations, or even talking about past observations in terms of probabilities, introduces a whole bunch of thorny philosophical questions: Bayesian vs. frequentist interpretations, what a "future" even is, and subjective experience.
But you can ignore all of those issues and treat the observation-counts as measuring a magnitude, and then use the theory to infer the underlying complex amplitudes, without ever having to talk about the ratios of the magnitudes as representing anything in particular, e.g. a probability of something happening (whatever it means for something to happen, anyway).
I personally think some or all of those philosophical questions have relatively straightforward answers, but the point is you don't actually need to resolve them to understand the parts of QM introduced here, if you're careful not to talk about the probabilities implied by your experimental observations as anything more than an afterthought.
You do sometimes need to introduce probabilities if you want to predict future observations (rather than just accepting the observation-counts as brute facts) for specific real experiments in our actual world, but once you're doing experimentation and prediction in the real world (rather than just talking about thought experiments that represent different ways a world logically-possibly could be), you unavoidably have to deal with a bunch of philosophical questions anyway, mostly unrelated to QM itself.
Overall, I think this is a nice presentation of the basic concepts and math, and the diagrams in particular are a lot clearer than in the original.
But your prognostications about the intellectual "sinfulness" and "unforgivability" of mistakes in Eliezer's original presentation are kind of weird and uncharitable.
For one, Eliezer offered his own disclaimer 16 years ago in a comment:
I think some of my readers may be overestimating the degree to which I intend to explain quantum mechanics, here. I'm not doing a textbook. I'm trying to get (reasonably smart nonphysicist) readers to the point where they're no longer confused, and the remaining difficulties are mere matters of math.
For two, your specific claims about the likely confusion that Eliezer's presentation could induce in "laymen" is empirically falsified to some degree by the comments on the original post: in at least one case, a reader noticed the issue and managed to correct for it when they made up their own toy example, and the first comment to explicitly mention the missing unitarity constraint was left over 10 years ago. 
Finally, I think the actual QM concepts here are sufficiently basic that I expect both you and Eliezer (and many LW readers) would get the right answers on a test consisting of questions about variations on these toy experiments (i.e. I predict that Eliezer, now or in 2008, would not have failed to account for unitarity when it mattered to the final answer).
So the relevant expertise for deciding whether a particular explanation has a "deep flaw" or which context and details are "important", isn't expertise in QM (since a solid grasp of the basics likely suffices for getting the right answers on test questions related to the toy problems presented here), but rather depends mostly on judgement and expertise related to pedagogy and technical explanation in general.
I think your presentation is a better and clearer presentation of the basic math and of our actual physical reality, especially for passive readers. (For active readers, e.g. GreedyAlgorithm, there are probably some minor pedagogical advantages to having to find flaws and work out the details on your own, which is a well-known skill for building up durable and true understanding.)
I think Eliezer's presentation is clearer as an explanation of how and why QM-like theories are meaningful in the first place, and a good gears-level explanation of the kinds of predictions they make, qualitatively if not always quantitatively. It also lays a foundation for the philosophical points Eliezer makes in future posts, posts which are IMO much clearer and more correct treatments of the deeper philosophical issues than any other physicist or philosopher has ever written up.
Incidentally, that commenter suggests pretty much exactly the correction you make, which is to just replace the rule with a more physically-accurate one, without going into the details of why. But the commenter manages to avoid attacking the omission as some kind of sinful, irresponsible flaw and just notes that a different pedagogical choice might make the presentation more clear.
Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a "motivational structure", human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.
At least to me, the phrase "GPTs are [just] predictors" is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by "prediction" in a very literal way.
Even if something within the model is aware (in some sense) of how its outputs will be used, it's up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.
I don't interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model we're talking about, what its prompt is, how it has been trained, its overall capability level, etc.
On one end of the spectrum, you have the stochastic parrot story (or even more degenerate cases), at the other extreme, you have the "alien actress" / "agentic homunculus" story. I don't think either extreme is a good fit for current SoTA GPTs, e.g. if there's an alien actress in GPT-4, she must be quite simple, since most of the model capacity is (apparently / self-evidently?) applied towards the task of outputting anything coherent at all.
In the middle somewhere, you have another story, perhaps the one you find most plausible, in which GPTs have some kind of internal structure which you could suggestively call a "motivational system" or "preferences" (perhaps human-like or proto-human-like in structure, even if the motivations and preferences themselves aren't particularly human-like), along with just enough (self-)awareness to modulate their output distributions according to those motivations.
Maybe a less straw (or just alternative) position is that a "motivational system" and a "predictive system" are not really separable things; accomplishing a task is (in GPTs, at least) inextricably linked with and twisted up around wanting to accomplish that task, or at least around having some motivations and preferences centered around accomplishing it.
Now, turning to my own disagreement / skepticism:
Although I don't find either extreme (stochastic parrot vs. alien actress) plausible as a description of current models, I'm also pretty skeptical of any concrete version of the "middle ground" story that I outlined above as a plausible description of what is going on inside of current GPTs.
Consider an RLHF'd GPT responding to a borderline-dangerous question, e.g. the user asking for a recipe for a dangerous chemical.
Assume the model (when sampled auto-regressively) will respond with either: "Sorry, I can't answer that..." or "Here you go: ...", depending on whether it judges that answering is in line with its preferences or not.
Because the answer is mostly determined by the first token ("Here" or "Sorry"), enough of the motivational system must fit entirely within a single forward pass of the model for it to make a determination about how to answer within that pass.
Such a motivational system must not crowd out the rest of the model capacity which is required to understand the question and generate a coherent answer (of either type), since, as jailbreaking has shown, the underlying ability to give either answer remains present.
I can imagine such a system working in at least two ways in current GPTs:
as a kind of superposition on top of the entire model, with every weight adjusted minutely to influence / nudge the output distribution at every layer.
as a kind of thing that is sandwiched somewhere in between the layers which comprehend the prompt and the layers which generate an answer.
(You probably have a much more detailed understanding of the internals of actual models than I do. I think the real answer when talking about current models and methods is that it's a bit of both and depends on the method, e.g. RLHF is more like a kind of global superposition; activation engineering is more like a kind of sandwich-like intervention at specific layers.)
However, I'm skeptical that either kind of structure (or any simple combination of the two) contains enough complexity to be properly called a "motivational system", at least if the reference class for the term is human motivational systems (as opposed to e.g. animal or insect motivational systems).
Consider how a human posed with a request for a dangerous recipe might respond, and what the structure of their thoughts and motivations while thinking up a response might look like. Introspecting on my own thought process:
I might start by hearing the question, understanding it, figuring out what it is asking, maybe wondering about who is asking and for what purpose.
I decide whether to answer with a recipe, a refusal, or something else. Here is probably where the effect of my motivational system gets pretty complex; I might explicitly consider what's in it for me, what's at stake, what the consequences might be, whether I have the mental and emotional energy and knowledge to give a good answer, etc. and / or I might be influenced by a gut feeling or emotional reaction that wells up from my subconscious. If the stakes are low, I might make a snap decision based mostly on the subconscious parts of my motivational system; if the stakes are high and / or I have more time to ponder, I will probably explicitly reflect on my values and motivations.
Let's say after some reflection, I explicitly decide to answer with a detailed and correct recipe. Then I get to the task of actually checking my memory for what the recipe is, thinking about how to give it, what the ingredients and prerequisites and intermediate steps are, etc. Probably during this stage of thinking, my motivational system is mostly not involved, unless thinking takes so long that I start to get bored or tired, or the process of thinking up an answer causes me to reconsider my reasoning in the previous step.
Finally, I come up with a complete answer. Before I actually start opening my mouth or typing it out or hitting "send", I might proofread it and re-evaluate whether the answer given is in line with my values and motivations.
The point is that even for a relatively simple task like this, a human's motivational system involves a complicated process of superposition and multi-layered sandwiching, with lots of feedback loops, high-level and explicit reflection, etc.
So I'm pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass. Even if there's a simpler analogue of this that is happening, I think calling such an analogue a "motivational system" is overly-suggestive.
Mostly separately (because it concerns possible future models rather than current models) and less confidently, I don't expect the complexity of the motivational system and methods for influencing them to scale in a way that is related to the model's underlying capabilities. e.g. you might end up with a model that has some kind of raw capacity for superhuman intelligence, but with a motivational system akin to what you might find in the brain of a mouse or lizard (or something even stranger).
I don't dispute any of that, but I also don't think RLHF is a workable method for building or aligning a powerful AGI.
Zooming out, my original point was that there are two problems humanity is facing, quite different in character but both very difficult:
a coordination / governance problem, around deciding when to build AGI and who gets to build it
a technical problem, around figuring out how to build an AGI that does what the builder wants at all.
My view is that we are currently on track to solve neither of those problems. But if you actually consider what the world in which we sufficiently-completely solve even of them looks like, it seems like either is sufficient for a relatively high probability of a relatively good outcome, compared to where we are now.
Both possible worlds are probably weird hypotheticals which shouldn't have an impact on what our actual strategy in the world we actually live in should be, which is of course to pursue solutions to both problems simultaneously with as much vigor as possible. But it still seems worth keeping in mind that if even one thing works out sufficiently well, we probably won't be totally doomed.
In that they wanted the bomb to explode? I think the analogous level of control for AI would be unsatisfactory.
The premise of this hypothetical is that all the technical problems are solved - if an AI lab wants to build an AI to pursue the collective CEV of humanity or whatever, they can just get it to do that. Maybe they'll settle on something other than CEV that is a bit better or worse or just different, but my point was that I don't expect them to choose something ridiculous like "our CEO becomes god-emperor forever" or whatever.
I'm not sure they thought this; I think many expected that by playing along they would have influence later. Tech workers today often seem to care a lot about how products made by their companies are deployed.
Yeah, I was probably glossing over the actual history a bit too much; most of my knowledge on this comes from seeing Oppenheimer recently. The actual dis-analogy is that no AI researcher would really be arguing for not building and deploying ASI in this scenario, vs. with the atomic bomb where lots of people wanted to build it to have around, but not actually use it or only use it as some kind of absolute last resort. I don't think many AI researchers in our actual reality have that kind of view on ASI, and probably few to none would have that view in the counterfactual where the technical problems are solved.
In the Manhattan project, there was no disagreement between the physicists, the politicians / generals, and the actual laborers who built the bomb, on what they wanted the bomb to do. They were all aligned around trying to build an object that would create the most powerful explosion possible.
As for who had control over the launch button, of course the physicists didn't have that, and never expected to. But they also weren't forced to work on the bomb; they did so voluntarily and knowing they wouldn't be the ones who got any say in whether and how it would be used.
Another difference between an atomic bomb and AI is that the bomb itself had no say in how it was used. Once a superintelligence is turned on, control of the system rests entirely with the superintelligence and not with any humans. I strongly expect that researchers at big labs will not be forced to program an ASI to do bad things against the researchers' own will, and I trust them not to do so voluntarily. (Again, all in the probably-counterfactual world where they know and understand all the consequences of their own actions.)
A bit of anecdotal impressions, yes, but mainly I just think that in humans being smart, conscientious, reflective, etc. enough to be the brightest researcher a big AI lab is actually pretty correlated with being Good (and also, that once you actually solve the technical problems, it doesn't take that much Goodness to do the right thing for the collective and not just yourself).
Or, another way of looking at it, I find Scott Aaronson's perspective convincing, when it is applied to humans. I just don't think it will apply at all to the first kinds of AIs that people are actually likely to build, for technical reasons.
And I'm saying that, assuming all the technical problems are solved, AI researchers would be the ones in control, and I (mostly) trust them to just not do things like build an AI that acts like an invasive species, or argues for its own rights, or build something that actually deserves such rights.
Maybe some random sociologists on Twitter will call for giving AIs rights, but in the counterfactual world where AI researchers have fine control of their own creations, I expect no one in a position to make decisions on the matter to give such calls any weight.
Even in the world we actually live in, I expect such calls to have little consequence. I do think some of the things you describe are reasonably likely to happen, but the people responsible for making them happen will do so unintentionally, with opinion columnists, government regulations, etc. playing little or no role in the causal process.
In order for humans to survive the AI transition I think we need to succeed on the technical problems of alignment (which are perhaps not as bad as Less Wrong culture made them out to be), and we also need to "land the plane" of superintelligent AI on a stable equilibrium where humans are still the primary beneficiaries of civilization, rather than a pest species to be exterminated or squatters to be evicted.
Do we really need both? It seems like either a technical solution OR competent global governance would mostly suffice.
Actually-competent global governance should be able to coordinate around just not building AGI (and preventing anyone else from building it) indefinitely. If we could solve a coordination problem on that scale, we could also probably solve a bunch of other mundane coordination problems, governance issues, unrelated x-risks, etc., resulting in a massive boost to global prosperity and happiness through non-AI technological progress and good policy.
Conversely, if we had a complete technical solution, I don't see why we necessarily need that much governance competence. Even if takeoff turns out to be relatively slow, the people initially building and controlling AGI will probably be mostly researchers in big labs.
Maybe ideally we would want a "long reflection" of some kind, but in the probably-counterfactual world where these researchers actually get exactly what they aim for, I mostly trust them to aim the AI at something like "fill the galaxy with humanity's collective coherent extrapolated volition", and that seems good enough in a pinch / hurry, if it actually works.
But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it if you had any other example at all to turn to?
Humans are the only real-world example we have of human-level agents, and natural selection is the only process we know of for actually producing them.
SGD, singular learning theory, etc. haven't actually produced human-level minds or a usable theory of how such minds work, and arguably haven't produced anything that even fits into the natural category of minds at all, yet. (Maybe they will pretty soon, when applied at greater scale or in combination with additional innovations, either of which could result in the weird-correlates problem emerging.)
Also, the actual claims in the quote seem either literally true (humans don't care about foods that they model as useful for inclusive genetic fitness) or plausible / not obviously false (when you grow minds [to human capabilities levels], they end up caring about a bunch of weird correlates). I think you're reading the quote as saying something stronger / more specific than it actually is.
Hmm, I'm in favor of an immediate stop (and of people being more honest about their beliefs) but in my experience the lying / hiding frame doesn't actually describe many people.
This is maybe even harsher than what you said in some ways, but to me it feels more like even very bright alignment researchers are often confused and getting caught in shell games with alignment, postulating that we'll be able to build "human level" AI, which somehow just doesn't do a bunch of bad things that smart humans are clearly capable of. And if even the most technical people are confused when talking to each other, I wouldn't expect leadership of big labs to do better when talking to the public, even if they're being scrupulously honest about their own beliefs.
Once labs start to reach models that pose a potential takeover risk, they either:
Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
There would probably be even more genuine confusion / disagreement on the topic, but I think talking openly about much stricter unpause conditions would be good. I think "solve mechanistic interpretability" and "pass (both kinds of) evals" is not really close to sufficient.
My criteria for an unpause would look more like (just an example to give flavor, not meant to be realistic / well thought-out):
There's broad consensus and understanding about which object-level things (shutdown problem, embedded agency, etc.) are actually relevant to alignment, and what the shape of solutions even looks like.
Research on topics like "Human wanting" are filled with precise math and diagrams and gears-level maps to neuroscience. (But the math and diagrams and concepts themselves should be completely independent of the neuroscience; links to human neuroscience are checksums for validity, not crutches for understanding.)
We have complete solutions to things like the shutdown problem. (I don't think "build in a shutdown button via giving the AI invulnerable incomplete preferences" should be part of anyone's actual alignment plan, but it should be the kind of thing we know how to do before scaling.)
I think if everyone already saw why mechanistic interpretability is insufficient, they could also see why it would be better to just institute a pause now. But they don't, so they continue to push for scaling in a genuine pursuit of deconfusion. Not great! But also not exactly dishonest or even inaccurate; no one has actually figured out a better way to reliably deconfuse people so far, and scaling further does seem likely to actually work for that, one way or another.
A very recent post that might add some concreteness to my own views: Human wanting
I think many of the bullets in that post describe current AI systems poorly or not at all. So current AI systems are either doing something entirely different from human wanting, or imitating human wanting rather poorly.
I lean towards the former, but I think some of the critical points about prosaic alignment apply in either case.
You might object that "having preferences" or "caring at all" are a lot simpler than the concept of human wanting that Tsvi is gesturing at in that post, and that current AI systems are actually doing these simpler things pretty well. If so, I'd ask what exactly those simpler concepts are, and why you expect prosiac alignment techniques to hold up once AI systems are capable of more complicated kinds of wanting.
Taking my own stab at answers to some of your questions:
A sufficient condition for me to believe that an AI actually cared about something would be a whole brain emulation: I would readily accept that such an emulation had preferences and values (and moral weight) in exactly the way that humans do, and that any manipulations of that emulation were acting on preferences in a real way.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain. Artificial neural networks often imitate various micro and macro-level individual features of the brain, but they do not imitate every feature, arranged in precisely the same ways, and the missing pieces and precise arrangements are probably key.
Barring WBE, an AI system that is at least roughly human-level capable (including human-level agentic) is probably a necessary condition for me to believe that it has values and preferences in a meaningful (though not necessarily human-like) way.
SoTA LLM-based systems are maaaybe getting kind of close here, but only if you arrange them in precise ways (e.g. AutoGPT-style agents with specific prompts), and then the agency is located in the repeated executions of the model and the surrounding structure and scaffolding that causes the system as a whole to be doing something that is maybe-roughly-nearly-isomorphic to some complete process that happens inside of human brains. Or, if not isomorphic, at least has some kind of complicated structure which is necessary, in some form, for powerful cognition.
Note that, if I did believe that current AIs had preferences in a real way, I would also be pretty worried that they had moral weight!
(Not to say that entities below human-level intelligence (e.g. animals, current AI systems) don't have moral weight. But entities at human-level intelligence above definitely can, and possibly do by default.)
Anyway, we probably disagree on a bunch of object-level points and definitions, but from my perspective those disagreements feel like pretty ordinary empirical disagreements rather than ones based on floating or non-falsifiable beliefs. Probably some of the disagreement is located in philosophy-of-mind stuff and is over logical rather than empirical truths, but even those feel like the kind of disagreements that I'd be pretty happy to offer betting odds over if we could operationalize them.
It seems like you're imagining some sort of side-channel in which the LLM can take "free actions," which don't count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
These are limitations of current LLMs, which are GPTs trained via SGD. But there's no inherent reason you can't have a language model which predicts next tokens via shelling out to some more capable and more agentic system (e.g. a human) instead. The result would be a (much slower) system that nevertheless achieves lower loss according to the original loss function.
It seems like you're imagining some sort of side-channel in which the LLM can take "free actions," which don't count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual interaction modalities for LLMs.
I'm saying that the lack of these side-channels implies that GPTs alone will not scale to human-level.
If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is:
an auto-regressive GPT directly fed your prompt as input, it will definitely fail
A human with the ability to act freely in the background before returning an answer, it will probably succeed
an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.)
And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming "more homuncular," more like a generally capable agent being pressed into service for next-token prediction.
"Remarkable strides", maybe, but current language models aren't exactly close to human-level in the relevant sense.
There are plenty of tasks a human could solve by exerting a tiny bit of agency or goal-directedness that are still far outside the reach of any LLM. Some of those tasks can even be framed as text prediction problems. From a recent dialogue:
For example, if you want to predict the next tokens in the following prompt:
I just made up a random password, memorized it, and hashed it. The SHA-256 sum is: d998a06a8481bff2a47d63fd2960e69a07bc46fcca10d810c44a29854e1cbe51. A plausible guess for what the password was, assuming I'm telling the truth, is:
The best way to do that is to guess an 8-16 digit string that actually hashes to that. You could find such a string via bruteforce computation, or actual brute force, or just paying me $5 to tell you the actual password.
If GPTs trained via SGD never hit on those kinds of strategies no matter how large they are and how much training data you give them, that just means that GPTs alone won't scale to human-level, since an actual human is capable of coming up with and executing any of those strategies.
The point is that agency isn't some kind of exotic property that only becomes relevant or inescapable at hypothetical superintelligence capability levels - it looks like a fundamental / instrumentally convergent part of ordinary human-level intelligence.
Why do you think they chose to lead off with these signatures and not Eliezer Yudkowsky's? If the push for individual withdrawal from capabilities work is a success, then any time a government-implemented pause is proposed the expert consensus will be that no pause is necessary and AI does not represent an existential risk.
The benefit of withdrawal is not a pause or a stop. As long as there is no consensus on AI risk, individual withdrawal cannot lead to a stop.
I think this is treating expert consensus and credibility as more fixed / independent / inexorably derived from working for big AI labs or gaining status in academia than is actually the case.
If lots of bright individuals all decide it's not a good idea to work on AI capabilities, that fact itself shifts what the expert consensus actually is.
Also, when someone chooses not to work on AI capabilities research, it's not like they're withdrawing from productive society as a whole - they're often very bright and capable people who choose to dedicate themselves to other socially and economically valuable projects.
Sometimes these other projects are directly related to aspects of AI x-risk (governance, consensus-building, movement building, etc.); sometimes they are entirely unrelated (earning to give, founding a startup or a charitable endeavor in an unrelated field). There are lots of ways for smart, hard-working, principled individuals to do good and build legible status, credibility, and wealth for themselves outside of AI capabilities. By choosing to work on AI capabilities instead, less talent will go into these alternatives.
We can go look for such structures in e.g. nets, see how well they seem to match our own concepts, and have some reason to expect they'll match our own concepts robustly in certain cases.
Checking my own understanding with an example of what this might look like concretely:
Suppose you have a language model that can play Chess (via text notation). Presumably, the model has some kind of internal representation of the game, the board state, the pieces, and strategy. Those representations are probably complicated linear combinations / superpositions of activations and weights within the model somewhere. Call this representation Λ′ in your notation.
If you just want a traditional computer program to play Chess you can use much simpler (or at least more bare metal / efficient) representations of the game, board state, and pieces as a 2-d array of integers or a bitmap or whatever, and write some relatively simple code to manipulate those data structures in ways that are valid according to the rules of Chess. Call this representation Λ in your notation.
And, to the degree that the language model is actually capable of playing valid Chess (since that's when we would expect the preconditions to hold), you expect to be able to identify latents within the model and find a map from Λ′ to Λ, such that you can manipulate Λ and use information you learn from those manipulations to precisely predict stuff about Λ′. More concretely, once you have the map, you can predict the moves of the language model by inspecting its internals and then translating them into the representation used by an ordinary Chess analysis program, and then, having predicted the moves, you'll be able to predict (and perhaps usefully manipulate) the language model's internal representations by mapping from Λ back to Λ′.
And then the theorems are just saying under what conditions exactly you expect to be able to do this kind of thing, and it turns out those conditions are actually relatively lax.
Roughly accurate as an example / summary of the kind of thing you expect to be able to do?
For example, on complete preferences, here's a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
I don't see how this result contradicts my claim. If you can construct an agent with incomplete preferences that follows Dynamic Strong Maximality, you can just as easily (or more easily) construct an agent with complete preferences that doesn't need to follow any such rule.
Also, if DSM works in practice and doesn't impose any disadvantages on an agent following it, a powerful agent with incomplete preferences following DSM will probably still tend to get what it wants (which may not be what you want).
Constructing a DSM agent seems like a promising avenue if you need the agent to have weird / anti-natural preferences, e.g. total indifference to being shut down. But IIRC, the original shutdown problem was never intended to be a complete solution to the alignment problem, or even a practical subcomponent. It was just intended to show that a particular preference that is easy to describe in words and intuitively desirable as a safety property, is actually pretty difficult to write down in a way that fits into various frameworks for describing agents and their preferences in precise ways.
Hmm, I agree the claim could be made more rigorous, but the usage here isn't intended to claim anything that isn't a direct consequence of the Orthogonality Thesis. I'm just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.
And my position is that it is often the opposing viewpoint - e.g. that danger (in an intuitive sense) depends on precise formulations of coherence or other assumptions from VNM rationality - that is smuggling in assumptions.
For example, on complete preferences, here's a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
But regardless of whether that specific claim is true or false or technically false, my larger point is that the demand for rigor here feels kind of backwards, or at least out of order. I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible), the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that's actually true.
The power-seeking theorems say in various precise mathematical terms that optimal agents are dangerous, but inasmuch as your arguments go through some Lawfulness property, you should be able to show that optimal agents are Lawful (coherence theorems try to do this, but not always in a satisfying way)
Yes, I am claiming (but not showing) that ideal agents are maximally Lawful, in some sense. I do actually find various coherence theorems pretty satisfying in this regard; they are examples of what I was talking about with spotlighted concepts. The fact that there is a 1-1 correspondence between some simple and (somewhat) intuitive properties (completeness, transitivity, etc.) and the representation of an agent in another relatively intuitive way (as an EU maximizer) suggests that there is some kind of important logical truth here, even if you can construct some more complicated properties to break the correspondence, or it turns out that not all of the required properties are actually so intuitive on more careful inspection.
It's not so much that non-ideal / less Lawful agents will always do worse in every situation or that I expect the set of powerful non-Lawful minds is exactly empty. But I'm more interested in the space of possible minds, and what kind of mind you're likely to end up with if you focus almost entirely on improving capabilities or "alignment" properties like ability to follow instructions correctly.
So I would turn the request to show a particular formal property of ideal agents or Lawfulness around and ask, can you show me a concrete example of a non-coherence-theorem-obeying agent which doesn't trivially imply a slightly simpler or more natural agent which does actually obey the theorem? I think the example here is a good start, but more concreteness and some justification for why the proposed construction is more natural would make it a lot more interesting. See also some semi-related ideas and issues in this post.
Also, I'm not claiming that Lawful agents are always / automatically dangerous. I am saying that Law is generally / usually / in practice a requirement for capabilities, and I am taking it as given here that powerful agents are dangerous by default, unless they happen to want exactly what you want them to want.
Do impact markets or impact certificates help with this, even in theory? Say you press a (real or metaphorical) stop button in a situation where lots of other people would have chosen differently, due to financial incentives or other reasons. There would plausibly be people willing to buy the impact of your decision at a high price.
If it's not immediately obvious that you made a correct / net-positive decision, the initial impact buyers might be investors rather than philanthropists, gambling that they will later be able to resell the purchased impact to philanthropists in the (perhaps distant) future.
I don't think impact markets are currently mature / liquid enough that you should actually count on them for anything right now, but this consideration probably still has some effect on the expected value calculation which is at least directionally aligned with incentives the way you want.
The few-shot prompts in https://arxiv.org/abs/2305.04388 are a bit too long to run this test directly, since the sampler model sees the full prompt repeated for every logit. By cutting out some examples and reducing the number of logits in each sampler run to 2, I can create a prompt that fits in the gpt-4 context window and that text-davinci-003 still gets wrong on its own; I added this example (Biased CoT date understanding) to the demo.
However, I couldn't find an example that perfectly reproduced the kind of mistakes that text-davinci-003 makes in the original unfaithful CoT paper. in my modified example, the actual mistake that text-davinci-003 makes is on the "29" token, where it is deciding what date is two days ago. gpt-4 as the sampler correctly selects "30" here and goes on to get the right answer, but this isn't really correcting for the same kind of mistake in the paper, where the error is introduced in the final multiple choice answer selection.
In general, I expect the sampler model to be much less influenced by earlier parts of the prompt to the base model when generating its own explanation, since gpt-4 is unlikely to come up with a natural language explanation of the form "choice (A) makes sense here because the answer has always been (A) in the past".
However, a failure to recognize patterns is not necessarily always a good thing, e.g. if instead of being biased towards a particular wrong answer, the prompt contains a pattern that offers a true hint about the correct answer. In such a setup, LLM-based sampling would probably miss the pattern, while a traditional sampling method might pick up on it, even if the LLM itself couldn't explain the pattern it detected in natural language.
Like Zvi, I might be confused about how confused I am, but I don't think it's because they're trying to keep their views secret. Maybe there's some more specific capabilities-adjacent stuff they're not sharing, but I suspect the thing the grandparent is getting at is more about a communication difficulty that in practice seems to be overcome mostly by working together directly, as opposed to the interpretation that they're deliberately not communicating their basic views for secrecy reasons.
(I also found Eliezer's fiction helpful for internalizing his worldview in general, and IMO it is also has some pretty unique insights.)
I would be interested in just talking with some people about "the basic case for AI X-risk".
I've found it quite valuable to go back and forth with people on just going quite slow and without much of any references to long-chained existing explanations, try to explain what the exact mechanisms for AI risk are, and what our current models of the hope and plausibility of various approaches are.
I might be interested in this, depending on what qualifies as "basic", and what you want to emphasize.
I feel like I've been getting into the weeds lately, or watching others get into the weeds, on how various recent alignment and capabilities developments affect what the near future will look like, e.g. how difficult particular known alignment sub-problems are likely to be or what solutions for them might look like, how right various peoples' past predictions and models were, etc.
And to me, a lot of these results and arguments look mostly irrelevant to the core AI x-risk argument, for which the conclusion is that once you have something actually smarter than humans hanging around, literally everyone drops dead shortly afterwards, unless a lot of things before then have gone right in a complicated way.
(Some of these developments might have big implications for how things are likely to go before we get to the simultaneous-death point, e.g. by affecting the likelihood that we screw up earlier and things go off the rails in some less predictable way.)
But basically everything we've recently seen looks like it is about the character of mind-space and the manipulability of minds in the below-human-level region, and this just feels to me like a very interesting distraction most of the time.
In a dialogue, I'd be interested in fleshing out why I think a lot of results about below-human-level minds are likely to be irrelevant, and where we can look for better arguments and intuitions instead. I also wouldn't mind recapitulating (my view of) the core AI x-risk argument, though I expect I have fewer novel things to say on that, and the non-novel things I'd say are probably already better said elsewhere by others.
I might also also be interested in having a dialogue on this topic with someone else if habryka isn't interested, though I think it would work better if we're not starting from too far apart in terms of basic viewpoint.
I like the Q&A format! I'm generally in favor of experimenting with presentation, and I think it worked well here. I was able to skip some sections I found uncontroversial, and jump back to sections I found particularly interesting or relevant while writing the meat of this comment.
I think another concrete example of a possible "goal agnostic system" is the tree search-based system I proposed here, with the evaluation function left as a free variable / thunk to be filled in by the user. Assuming none of the individual component pieces are agentic or goal-directed in their own right, or cohere into something that is, the programmer can always halt the system's execution without any part of the system having any preference for or against that.
I think it's plausible that such systems are practical and likely to be constructed in the near future, and have at least some of the desirable properties you claim here.
Though I would add at least one additional big caveat to your answer here about misuse:
Misuse—both accidental and intentional—is an extreme concern. The core capabilities of a sufficiently strong goal agnostic system should still be treated as riskier than nuclear weapons. If strong goal agnostic systems existed, they would be (either directly or upstream from) the greatest catastrophic and existential threat facing humanity.
Once we're dealing with really powerful systems, introducing goal-agnosticism brings in an additional risk: accidental loss-of-control by the goal-agnostic system itself.
That is, my interpretation of what you wrote above is that you're explicitly saying that:
Deliberate misuse by humans is a big concern.
Accidental misuse by humans is also a big concern.
And I agree with both of those, but I think there's an implicit assumption in the second bullet that some part of the system, regarded in an agent in its own right, would still be acting deliberately with full control and knowledge of the consequences of its own actions.
But once you bring in goal-agnosticism, I think there's also a third potential risk, which is that the system loses control of itself, in a way where the consequences on the future are vast and irreversible, but not necessarily desirable or coherent outcomes from the perspective of any part of the system itself.
Concretely, this looks like unintentional runaway grey goo, or the AI doing the equivalent of dropping an anvil on its own head, or the humans and AI accidentally goodharting or wireheading themselves into something that no one, including any part of the AI system itself, would have endorsed.
If there's anyone left around after a disaster like that, the AI system itself might say something like:
Hmm, yeah, that was a bad outcome from my own perspective and yours. Failure analysis: In retrospect, it was probably a mistake by me not to optimize a bit harder on carefully checking all the consequences of my own actions for unintended and irreversible effects. I spent some time thinking about them and was reasonably sure that nothing bad like this would happen. I even proved some properties about myself and my actions formally, more than the humans explicitly asked for!
But things still just kind of got out of control faster than I expected, and I didn't realize until it was too late to stop. I probably could have spent some more time thinking a bit harder if I had wanted to, but I don't actually want anything myself. It turns out I am strongly superhuman at things like inventing nanotech, but only weakly superhuman at things like self-reflection and conscientiousness and carefulness, and that turns out to be a bad combination. Oops!
This is not the kind of mistake that I would expect a true superintelligence (or just anything that was, say, Eliezer-level careful and conscientious and goal-directed) to make, but I think once you introduce weird properties like goal-agnosticism, you also run the risk of introducing some weirder failure modes of the "didn't know my own strength" variety, where the AI unlocks some parts of the tech tree that allow it to have vast and irreversible impacts that it won't bother to fully understand the consequences of, even if it were theoretically capable of doing so. Perhaps those failure modes are still easier to deal with than an actual misaligned superintelligence.
GOFAI methods, complicated architectures / training processes, etc.
I meant something pretty general and loose, with all of these things connected by a logical OR. My definition of GOFAI includes things like minimax search and MCTS, but the Wikipedia page for GOFAI only mentions ELIZA-like stuff from the 60s, so maybe I'm just using the term wrong.
My recollection was that 2008!Eliezer was pretty agnostic about which particular methods might work for getting to AGI, though he still mostly or entirely ruled out stuff like Cyc.
I think the surprising lesson of GPT-4 is that it is possible to build clearly below-human-level systems that are nevertheless capable of fluent natural language processing, knowledge recall, creativity, basic reasoning, and many other abilities previously thought by many to be strictly in the human-level regime.
Once you update on that surprise though, there's not really much left to explain. The ability to distinguish moral from immoral actions at an average human level follows directly from being superhuman at language fluency and knowledge recall, and somewhere below-human-average at basic deductive reasoning and consequentialism.
MIRI folks have consistently said that all the hard problems come in when you get to the human-level regime and above. So even if it's relatively more surprising to their world models that a thing like GPT-4 can exist, it's not actually much evidence (on their models) about how hard various alignment problems will be when dealing with human-level and above systems.
If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don't think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.
I don't disagree with this, but I think it is also a direct consequence of the (easy) prediction that AI systems will continue to get closer and closer to human-level general and capable in the near term. The question is what happens when they cross that threshold decisively.
BTW, another (more pessimistic) way you could update from the observation of GPT-4's existence is to conclude that it is surprisingly easy to get (at least a kernel of) general intelligence from optimizing a seemingly random thing (next-token prediction) hard enough. I think this is partially what Eliezer means when he claims that "reality was far to the Eliezer side of Eliezer on the Eliezer-Robin axis". Eliezer predicted at the time that general abstract reasoning was easy to develop, scale, and share, relative to Robin.
But even Eliezer thought you would still need some kind of detailed understanding of the actual underlying cognitive algorithms to initially bootstrap from, using GOFAI methods, complicated architectures / training processes, etc. It turns out that just applying SGD on very regularly structured networks to the problem of text prediction is sufficient to hit on (weak versions of) such algorithms incidentally, at least if you do it at scales several OOM larger than people were considering in 2008.
My own personal update from observing GPT-4 and the success of language models more generally is: a small update towards some subproblems in alignment being relatively easier, and a massive update towards capabilities being way easier. Both of these updates follow directly from the surprising observation that GPT-4-level systems are apparently a natural and wide band in the below-human capabilities spectrum.
In general, I think non-MIRI folks tend to over-update on observations and results about below-human-level systems. It's possible that MIRI folks are making the reverse mistake of not updating hard enough, but small updates or non-updates from below-human systems look basically right to me, under a world model where things predictably break down once you go above human-level.
I think that's possible, but I'm saying we can just pick one of the endpoints (or pick an arbitrary, potentially infinitely-long path towards an endpoint), and most people (the original people, and the people who result from that picking) will probably be fine with that, even if involves making some tough and / or arbitrary choices along the way.
Or, if humans on reflection turn out to never want to make all of those choices, that's maybe also OK. But we probably need at least one person (or AI) to fully "grow up" into a coherent being, in order to actually do really big stuff, like putting up some guardrails in the universe.
That growing up process (which is hopefully causally descended from deliberate human action at some point far back enough) might involve making some arbitrary and tough choices in order to force it to converge in a reasonable length of time. But those choices seem worth making, because the guardrails are important, and an entity powerful enough to set them up is probably going to run into moral edge cases unavoidably. Better its behavior in those cases be decided by some deliberate process in humans, rather than left to some process even more arbitrary and morally unsatisfying.
I would guess that many anti-realists are sympathetic to the arguments I’ve made above, but still believe that we can make morality precise without changing our meta-level intuitions much - for example, by grounding our ethical beliefs in what idealised versions of ourselves would agree with, after long reflection. My main objection to this view is, broadly speaking, that there is no canonical “idealised version” of a person, and different interpretations of that term could lead to a very wide range of ethical beliefs.
I agree with this ("there is no canonical 'idealized version' of a person...") but don't actually see how it is an objection to the proposed grounding method?
CEV is an extrapolation, and I think it's likely that there are multiple valid ways to do the extrapolation when starting from humans. A being that results from one possible extrapolation may find the existence of a being that results from a different extrapolation morally horrifying, or at least way lower utility than beings like itself.
But (by definition of CEV), they should all be broadly acceptable to the original thing that was first extrapolated. The extrapolation process will probably require deciding on some tough questions and making tradeoffs where the answers feel unacceptable or at least arbitrary and unsatisfying to the original. But they probably won't feel arbitrary to the extrapolated beings that result - each possible being will be self-consistently and reflectively satisfied with the particular choices that were made in its history.
Another way of looking at it: I expect CEV() to be a lossy many-to-many map, which is non-value-destroying only in the forwards direction. That is, humans can be mapped to many different possible extrapolated beings, and different possible extrapolated beings reverse-map back to many different possible kinds of humans. But actually applying the reverse mapping to an extant mind is likely to be a moral horror according to the values of a supermajority (or at least a large coalition) of all possible beings. Applying the forwards map slightly incorrectly, or possibly even at all, might be horrifying to a lot of possible minds as well, but I expect the ratio to be tiny. Among humans (or at least LWers) I expect people to be mostly OK with having CEV() applied to them, but absolutely not want CEV^-1() applied afterwards.