Posts
Comments
I define rationality as "more in line with your overall values". There are problems here, because people do profess social values that they don't really hold (in some sense), but roughly it is what they would reflect on and come up with.
Someone could value the short-term more than the long-term, but I think that most don't. I'm unsure if this is a side-effect of Christianity-influenced morality or just a strong tendency of human thought.
Locally optimal is probably the correct framing, but that it is irrational relative to whatever idealized values the individual would have. Just like how a hacky approximation of a Chess engine is irrational relative to Stockfish—they both can be roughly considered to have the same goal, just one has various heuristics and short-term thinking that hampers it. These heuristics can be essential, as it runs with less processing power, but in the human mind they can be trained and tuned.
Though I do agree that smoking isn't always irrational: I would say smoking is irrational for the supermajority of human minds, however. The social negativity around smoking may be what influences them primarily, but I'd consider that just another fragment of being irrational— >90% of them would have a value for their health, but they are varying levels of poor at weighting the costs and the social negativity response is easier for the mind to emulate. Especially since they might see people walking around them while they're out taking a cigarette. (Of course, the social approval is some part of a real value too; though people have preferences about which social values they give into)
An important question here is "what is the point of being 'more real'?". Does having a higher measure give you a better acausal bargaining position? Do you terminally value more realness? Less vulnerable to catastrophes? Wanting to make sure your values are optimized harder?
I consider these, except for the terminal sense, to be rather weak as far as motivations go.
Acausal Bargaining: Imagine a bunch of nearby universes with instances of 'you'. They all have variations, some very similar, others with directions that seem a bit strange to the others. Still identifiably 'you' by a human notion of identity. Some of them became researchers, others investors, a few artists, writers, and a handful of CEOs.
You can model these as being variations on some shared utility function: where is shared, and is the individual utility function. Some of them are more social, others cynical, and so on. A believable amount of human variation that won't necessarily converge to the same utility function on reflection (but quite close).
For a human, losing memories so that you are more real is akin to each branch chopping off the . They lose memories of a wonderful party which changed their opinion of them, they no longer remember the horrors of a war, and so on.
Everyone may do the simple ask of losing all their minor memories which has no effect on the utility function, but then if you want more bargaining power, do you continue? The hope is that this would make your coalition easier to locate, to be more visible in "logical sight". That this increased bargaining power would thus ensure that, at the least, your important shared values are optimized harder than they could if you were a disparate group of branches.
I think this is sometimes correct, but often not.
From a simple computationalist perspective, increasing the measure of the 'overall you' is of little matter. The part that bargains, your rough algorithm and your utility function, is already shared: is shared among all your instances already, some of you just have considerations that pull in other directions ().
This is the same core idea of the FDT explanation of why people should vote: because, despite not being clones of you, there is a group of people that share similar reasoning as you. Getting rid of your memories in the voting case does not help you!
For the Acausal Bargaining case, there is presumably some value in being simpler. But, that means more likely that you should bargain 'nearby' to present a computationally cheaper value function 'far away'. So, similar to forgetting, where you appear as if having some shared utility function, but without actually forgetting—and thus being able to optimize for in your local universe. As well, the bargained utility function presented far away (less logical sight to your cluster of universes) is unlikely to be the same as .
So, overall, my argument would be that forgetting does give you more realness. If at 7:59AM, a large chunk of universes decide to replace part of their algorithm with a specific coordinated one (like removing a memory) then that algorithm is instantiated across more universes. But, that from a decision-theoretic perspective, I don't think that matters too much? You already share the important decision theoretic parts, even if the whole algorithm is not shared.
From a human perspective we may care about this as a value of wanting to 'exist more' in some sense. I think this is a reasonable enough value to have, but that it is oft satisfied by considering the sharing of decision methods and 99.99% of personality is enough.
My main question of whether this is useful beyond a terminal value for existing more is about quantum immortality—of which I am more uncertain about.
Beliefs and predictions that influence wants may be false or miscalibrated, but the feeling itself, the want itself, just is what it is, the same way sensations of hunger or heat just are what they are.
I think this may be part of the disconnect between me and the article. I often view the short jolt preferences (that you get from seeing an ice-cream shop) as heuristics, as effectively predictions paired with some simpler preference for "sweet things that make me feel all homey and nice". These heuristics can be trained to know how to weigh the costs, though I agree just having a "that's irrational" / "that's dumb" is a poor approach to it. Other preferences, like "I prefer these people to be happy" are not short-jolts but rather thought about and endorsed values that would take quite a bit more to shift—but are also significantly influenced by beliefs too.
Other values like "I enjoy this aesthetic" seem more central to your argument than short-jolts or considered values.
This is why you could view a smoker's preference for another cigarette as irrational: the 'core want' is just a simple preference for the general feel of smoking a cigarette, but the short-jolt preference has the added prediction of "and this will be good to do". But that added prediction is false and inconsistent with everything they know. The usual statement of "you would regret this in the future". Unfortunately, the short-jolt preference often has enough strength to get past the other preferences, which is why you want to downweight it.
So, I agree that there's various preferences that having them is disentangled from whether you're rational or not, but that I also think most preferences are quite entangled with predictions about reality.
“inconsistent preferences” only makes sense if you presume you’re a monolithic entity, or believe your "parts" need to all be in full agreement all the time… which I think very badly misunderstands how human brains work.
I agree that humans can't manage this, but it does still make sense for a non-monolithic entity—You'd take there being an inconsistency as a sign that there's a problem, which is what people tend to do, even if ti can't be fixed.
Finally, the speed at which you communicate vibing means you're communicating almost purely from System 1, expressing your actual felt beliefs. It makes deception both of yourself and others much harder. Its much more likely to reveal your true colors. This allows it to act as a values screening mechanism as well.
I'm personally skeptical of this. I've found I'm far more likely to lie than I'd endorse when vibing. Saying "sure I'd be happy to join you on X event" when it is clear with some thought that I'd end up disliking it. Or exaggerating stories because it fits with the vibe.
I view System-1 as less concerned with truth here, it is the one that is more likely to produce a fake-argument in response to a suggested problem. More likely to play social games regardless of if they make sense.
I agree that it is easy to automatically lump the two concepts together.
I think another important part of this is that there are limited methods for most consumers to coordinate against companies to lower their prices. There's shopping elsewhere, leaving a bad review, or moral outrage. The last may have a chance of blowing up socially, such as becoming a boycott (but boycotts are often considered ineffective), or it may encourage the government to step in. In our current environment, the government often operates as the coordination method to punish companies for behaving in ways that people don't want. In a much more libertarian society we would want this replaced with other methods, so that consumers can make it harder to put themselves in a prisoner's dilemma or stag hunt against each other.
If we had common organizations for more mild coordination than the state interfering, then I believe this would improve the default mentality because there would be more options.
It has also led to many shifts in power between groups based on how well they exploit reality. From hunter-gatherers to agriculture, to grand armies spreading an empire, to ideologies changing the fates of entire countries, and to economic & nuclear super-powers making complex treaties.
This reply is perhaps a bit too long, oops.
Having a body that does things is part of your values and is easily described in them. I don't see deontology or virtue ethics as giving any more fundamentally adequate solution to this (beyond the trivial 'define a deontological rule about ...', or 'it is virtuous to do interesting things yourself', but why not just do that with consequentialism?).
My attempt at interpreting what you mean is that you're drawing a distinction between morality about world-states vs. morality about process, internal details, experiencing it, 'yourself'. To give them names, "global"-values (you just want them Done) & "indexical"/'local"-values (preferences about your experiences, what you do, etc.) Global would be reducing suffering, avoiding heat death and whatnot. Local would be that you want to learn physics from the ground up and try to figure out XYZ interesting problem as a challenge by yourself, that you would like to write a book rather than having an AI do it for you, and so on.
I would say that, yes, for Global you should/would have an amorphous blob that doesn't necessarily care about the process. That's your (possibly non-sentient) AGI designing a utopia while you run around doing interesting Local things. Yet I don't see why you think only Global is naturally described in consequentialism.
I intrinsically value having solved hard problems—or rather, I value feeling like I've solved hard problems, which is part of overall self-respect, and I also value realness to varying degrees. That I've actually done the thing, rather than taken a cocktail of exotic chemicals. We could frame this in a deontological & virtue ethics sense: I have a rule about realness, I want my experiences to be real. / I find it virtuous to solve hard problems, even if in a post-singularity world.
But do I really have a rule about realness? Uh, sort-of? I'd be fine to play a simulation where I forget about the AGI world and am in some fake-scifi game world and solve hard problems. In reality, my value has a lot more edge-cases that will be explored than many deontological rules prefer. My real value isn't really a rule, it is just sometimes easy to describe it that way. Similar to how "do not lie" or "do not kill" is usually not a true rule.
Like, we could describe my actual value here as a rule, but seems actually more alien to the human mind. My actual value for realness is some complicated function of many aspects of my life, preferences, current mood to some degree, second-order preferences, and so on. Describing that as a rule is extremely reductive.
And 'realness' is not adequately described as a complete virtue either. I don't always prefer realness: if playing a first-person shooter game, I prefer that my enemies are not experiencing realistic levels of pain! So there are intricate trade-offs here as I continue to examine my own values.
Another aspect I'm objecting to mentally when I try to apply those stances is that there's two ways of interpreting deontology & virtue ethics that I think are common on LW. You can treat them as actual philosophical alternatives to consequentialism, like following the rule "do not lie". Or you can treat them as essentially fancy words for deontology=>"strong prior for this rule being generally correct and also a good coordination point" and virtue ethics=>"acting according to a good Virtue consistently as a coordination scheme/culture modification scheme and/or because you also think that Virtue is itself a Good".
Like, there's a difference between talking about something using the language commonly associated with deontology and actually practicing deontology. I think conflating the two is unfortunate.
The overaching argument here is that consequentialism properly captures a human's values, and that you can use the basic language of "I keep my word" (deontology flavored) or "I enjoy solving hard problems because they are good to solve" (virtue ethics flavored) without actually operating within those moral theories. You would have the ability to unfold these into the consequentialist statements of whatever form you prefer.
In your reply to cubefox, "respect this person's wishes" is not a deontological rule. Well, it could be, but I expect your actual values don't fulfill that. Just because your native internal language suggestively calls it that, doesn't mean you should shoehorn it into the category of rule!
"play with this toy" still strikes me as natively a heuristic/approximation to the goal of "do things I enjoy". The interlinking parts of my brain that decided to bring that forward is good at its job, but also dumb because it doesn't do any higher order thinking. I follow that heuristic only because I expect to enjoy it—the heuristic providing that information. If I had another part of my consideration that pushed me towards considering whether that is a good plan, I might realize that I haven't actually enjoyed playing with a teddy bear in years despite still feeling nostalgia for that. I'm not sure I see the gap between consequentialism and this. I don't have the brain capacity to consider every impulse I get, but I do want to consider agents other than AIXI to be a consequentialist.
I think there's a space in there for a theory of minds, but I expect it would be more mechanistic or descriptive rather than a moral theory. Ala shard theory.
Or, alternatively, even if you don't buy my view that the majority of my heuristics can be cast as approximations of consequentialist propositions, then deontology/virtue ethics are not natural theories either by your descriptions. They miss a lot of complexity even within their usual remit.
I think there's two parts of the argument here:
- Issues of expressing our values in a consequentialist form
- Whether or not consequentialism is the ideal method for humans
The first I consider not a major problem. Mountain climbing is not what you can put into the slot to maximize, but you do put happiness/interest/variety/realness/etc. into that slot. This then falls back into questions of "what are our values". Consequentialism provides an easy answer here: mountain climbing is preferable along important axes to sitting inside today. This isn't always entirely clear to us, we don't always think natively in terms of consequentialism, but I disagree with:
There are many reasons to do things - not everything has to be justified by consequences.
We just don't usually think in terms of consequences, we think in terms of the emotional feeling of "going mountain climbing would be fun". This is a heuristic, but is ultimately about consequences: that we would enjoy the outcome of mountain climbing better than the alternatives immediately available to our thoughts.
This segues into the second part. Is consequentialism what we should be considering? There's been posts about this before, of whether our values are actually best represented in the consequentialist framework.
For mountain climbing, despite the heuristic of "I feel like mountain climbing today", if I learned that I would actually enjoy going running for an hour then heading back home more, then I would do that instead. When I'm playing with some project, part of that is driven by in-the-moment desires, but ultimately from a sense that this would be an enjoyable route.This is part of why I view the consequentialist lens as a natural extension of most if not all of our heuristics.
An agent that really wanted to go in circles doesn't necessarily have to stop, but for humans we do care about that.
There's certainly a possible better language/formalization to talk about agents that are mixes of consequentialist parts and non-consequentialist parts, which would be useful for describing humans, but I also am skeptical about your arguments for non-consequentialist elements of human desires.
If I value a thing at one period of life and turn away from it later, I have not discovered something about my values. My values have changed. In the case of the teenager we call this process “maturing”. Wine maturing in a barrel is not becoming what it always was, but simply becoming, according to how the winemaker conducts the process.
Your values change according to the process of reflection - the grapes mature into wine through fun chemical reactions.
From what you wrote, it feels like you are mostly considering your 'first-order values'. However, you have an updating process that you also have values about. Like that I wouldn't respect simple mind control that alters my first-order values, because my values consider mind-control as disallowed.
Similar to why I wouldn't take a very potent drug even if I know my first-order values would rank the feeling very highly, because I don't endorse that specific sort of change.
I have never eaten escamoles. If I try them, what I will discover is what they are like to eat. If I like them, did I always like them? That is an unheard-falling-trees question.
Then we should split the question. Do you have a value for escamoles specifically before eating them? No. Do you have a system of thought (of updating your values) that would ~always result in liking escamoles? Well, no in full generality. You might end up with some disease that affects your tastebuds permanently. But in some reasonably large class of normal scenarios, your values would consistently update in a way that would end up liking escamoles were you to ever eat them. (But really, the value for escamoles is more instrumental of a value for [insert escamole flavor, texture, etc.] here, that the escamoles are learned to be a good instance of.)
What johnwentworth mentions would then be the question of "Would this approved process of updating my values converge to anything"; or tend to in some reasonable reference class; or at least have some guaranteed properties that aren't freely varying. I don't think he is arguing that the values are necessarily fixed and always persistent (I certainly don't always handle my values according to my professed beliefs about how I should updatethem), but that they're constrained. That the brain also models them as reasonably constrained, and that you can learn important properties of them.
Is there a way to get an article's raw or original content?
My goal is mostly to put articles in some area (ex: singular learning theory) into a tool like Google's NotebookLM to then ask quick questions about.
Google's own conversion of HTML to text works fine for most content, excepting math. A division may turn into p ( w | D n ) = p ( D n | w ) φ ( w ) p ( D n ), becoming incorrect.
I can always just grab the article's HTML content (or use the GraphQL api for that), but HTMLified MathJax notation is very, uh, verbose. I could probably do some massaging of the data and then an LLM to translate it back into the more typical markdown $ delimited syntax, but I'm hopeful that there's some existing method to avoid that entirely.
I'd be interested in an article looking at whether the FDA is better at regulating food safety. I do expect food is an easier area, because erring on the side of caution doesn't really lose you much — most food products have close substitutes. If there's some low but not extremely low risk of a chemical in a food being bad for you, then the FDA can more easily deny approval without significant consequences: Medicine has more outsized effects if you are slow to approve usage.
Yet, perhaps this has led to reduced variety in food choices? I notice less generic or lesser-known food and beverage brands relative to a decade ago, though I haven't verified whether my that background belief is accurate. I'd be curious also for an investigation in such an article about the extent of the barriers to designing a new food product; especially food products that aren't doing anything new, purely a mixture of ingredients already considered safe (or at least, considered allowed). Would there be more variety? Or notably cheaper food?
https://www.lesswrong.com/posts/zo9zKcz47JxDErFzQ/call-for-distillers
I see this as occurring with various pieces of Infrabayesianism, like Diffractor's UDT posts. They're dense enough mathematically (hitting the target) which makes them challenging to read... and then also challenging to discuss. There are fewer comments even from the people who read the entire post because they don't feel competent enough to make useful commentary (with some truth behind that feeling); the silence also further making commentation harder. At least that's what I've noticed in myself, even though I enjoy & upvote those posts.
Less attention seems natural because of specialization into cognitive niches, not everyone has read all the details of SAEs, or knows all the mathematics referenced in certain agent foundations posts. But it does still make it a problem in socially incentivizing good research.
I don't know if there are any great solutions. More up-weighting for research-level posts? I view the distillation idea from a ~year ago as helping with drawing attention towards strong (but dense) posts, but it appeared to die down. Try to revive that more?
I draw the opposite conclusion from this: the fact that the decision theory posts seem to work on the basis of a computationalist theory of identity makes me think worse of the decision-theory posts.
Why? If I try to guess, I'd point at not often considering indexicality as a consideration, merely thinking of it as having a single utility function which simplifies coordination. (But still, a lot of decision theory doesn't need to take into account indexicality..)
I see the decision theory posts as less as giving new intuitions, and more breaking old ones that are ill-adapted, though that's partially framing/semantics.
Can you link to some of these? I do not recall seeing anything like this here.
I'll try to find some, but they're more likely to be side parts of comment chains rather than posts, which does make them more challenging to search for. I doubt they're as in-depth as we'd like, I think there is work done there, even if I do think the assumption of QM not mattering much is likely.
The basic idea is what would it give you? If the brain uses it for a random component, why can't that be replaced with something pseudorandom? Which is fine from an angle of not seeing determinism as a problem. If the brain utilizes entangled atoms/neurons/whatever for efficiency, why can't those be replaced with another method — possibly impractically inefficient? Does the brain functionally depend on an arbitrary precision Real for a calculation, why would it, and what would be the matter if it was cut off to N digits?
- Somewhat Eliezer's Comment Here and some of the other pieces
- Does davidad's uploading moonshot work which has more specifics about what davidad thinks is relevant to uploading
- With this as also a good article to read as a reply
- QM Has nothing to do with consciousness meh
- Scott Aaronson on Free Will About more than just FW, though he's arguing against the LW position, but I don't consider it a strong argument, see the comments for a bit of discussion.
- Quotes and Notes on Scott Aaronson's has more positive leaning commentary
There's certainly more, but finding specific comments I've read over the years is a challenge.
Everything was determined in the initial configuration of quantum waveforms in the distant past of your lightcone. The experience of time and change is just a side-effect of your embeddedness in this giant static many-dimensional universe."
I'm not sure I understand the distinction. Even if the true universe is a bunch of freeze-frame slices, time and change still functionally act the same. Given that I don't remember random nonsense in my past, there's some form of selection about which freeze-frames are constructed. Or, rather, with differing measure. Thus most of my 'future' measure is concentrated on freeze-frames that are consistent with what I've observed, as that has held true in the past.
Like, what you seem to be saying is Timeless Physics, of which I'd agree more with this statement:
An unchanging quantum mist hangs over the configuration space, not churning, not flowing. But the mist has internal structure, internal relations; and these contain time implicitly. The dynamics of physics—falling apples and rotating galaxies—is now embodied within the unchanging mist in the unchanging configuration space.
So I'd agree that computation only makes sense with some notion of time. That there has to be some way it is being stepped forward. (To me this is an argument in favor of not privileging spatial position in the common teleportation example, but we've seemed to move down a level to whether the brain can be implemented at all)
(bits about CEV) conceptually incoherent
I misworded what I say, sorry. I more meant that you consider it to say/imply nothing meaningful, but you can certainly still argue against it (such as arguing that it isn't coherent).
I think it would be non-physicalist if (to slightly modify the analogy, for illustrative purposes) you say that a computer program I run on my laptop can be identified with the Python code it implements, because it is not actually what happens.
I would say the that the computer program running can be considered as an implementation of the abstract python code. I agree that this model is missing details. Such as the exact behavior of the transistor, how fast it switches, the exact positions of the atoms, etcetera. That is dependent on the mind considering it, I agree. The cosmic ray event would make so it is no longer an implementation of the abstract python program. You could expand the consideration to include more of the universe. Just as you could expand your model to consider the computer program as an implementation of the python program with some constraints: that if this specific transistor gets flipped one too many times it will fry, that there's a slight possibility of a race condition that we didn't consider at all in our abstract implementation, there's a limit to the speed and heat it can operate at, a cosmic ray could come from these areas of space and hit it with 0.0x% probability thus disrupting functionality...
It still seems quite reasonable to say it is an implementation of the python program. I'm open to the argument that there isn't a completely natural privileged point of consideration from which the computer is implementing the same pattern as another computer, and that the pattern is this python program. But as I said before, even if this is ultimately some amount of purely subjective, it still seems to capture quite a lot of the possible ideas?
Like in mathematics, I can have an abstract implementation of a sorting algorithm and prove that a python program for a more complicated algorithm (bubblesort, whatever) is equivalent. This is missing a lot of details, but that same sort of move is what I'm gesturing at.
It is merely part of a mathematical model that, as I've described in response to Ruby earlier, represents a very lossy compression of the underlying physical substrate
I can understand why you think that just the neurons / connections is too lossy, but I'm very skeptical of the idea that we'd need all of the amplitudes related to the brain/mind. Apriori that seems unlikely whatwith how little fundamentally turns on the specifics of QM, and those that do can all be implemented specially. As I discussed above some.
(That also reminds me of another reason why people sometimes just mentions neurons/connections which I forgot in my first reply: because they assume you've gotten the basic brain architecture that is shared and just need to plug in the components that vary)
I disagree that this distinction between our model and reality has been lost, merely that it has been deemed not too significant, or as something you'd study in-depth when actually performing brain uploads.
What is "the computation"? Can we try to taboo that word?
As I said in my previous comment, and earlier in this one, I'm open to the idea of computation being subjective instead of a purely natural concept. Though I'd expect that there's not that many free variables in pinning down the meaning. As for tabooing, I think that is kind of hard, as one very simple way of viewing computation is "doing things according to rules".
You have an expression . This is in your mind and relies on subjective interpretations of what the symbols mean. You implement that abstract program (that abstract doing-things, a chain of rules of inference, a way that things interact) into a computer. The transistors were utilized because they matched the conceptual idea of how switches should function, but they have more complexities than the abstract switch, which introduces design constraints throughout the entire chip. The chip's ALU implements this through a bunch of transistors, which are more fundamentally made up of silicon in specific ways that regulate how electricity moves. There's layers and layers of complexities even as it processes the specific binary representations of the two numbers and shifts them in the right way. But, despite all this, all that fundamental behavior, all the quantum effects like tunneling which restrict size and positioning, it is computing the answer. You see the result, , and are pretty confident that no differences between your simple model of the computer and reality occurred.
This is where I think arguments about subjectivity of computation can be made. Introduce a person who is talking about a different abstract concept, they encode it as binary because that's what you do, and they have an operation that looks like multiplication and produces the same answer for that binary encoding. Then, the interpretation of that final binary output is dependent on the mind, because the mind has a different idea of what they're computing. (But with the abstract idea being different, even if those parts match up) But I think a lot of those cases are non-natural, which is part of why I think even if computation doesn't make sense as a fundamental thing or a completely natural concept, it still covers a wide area of concern and is a useful tool. Similar to how the distinction of values and beliefs is a useful tool even when strictly discussing humans, but even moreso. So then, the two calculators are implementing the same abstract algorithm in their silicon, and then we fall back to two questions 1) is the mind within the edge-cases such that it is not entirely meaningful to talk about an abstract program that it is implementing 2) okay, but even if they share the same computation, what does that imply. I think there could and should be more discussion of the complications around computation, with the easy to confuse interaction between levels of 'completely abstract idea' (platonism?), 'abstract idea represented in the mind' (what I'm talking about with abstract; subjective), 'the physical way that all the parts of this structure behave' (excessive detail but as accurate as possible; objective), 'the way these rules do a specific abstract idea' (chosen because of abstract ideas like a transistor is chosen because it functions like a switch, and the computer program is compiled in such a way because it matches the textual code you wrote which matches the abstract idea in your own mind; objective in that it is behaving in such a way, possibly subjective interpretation of the implications of that behavior).
We could also view computation through the lens of Turing Machines, but then that raises the argument of "what about all these quantum shenanigans, those are not computable by a turing machine". I'd say that finite approximations get you almost all of what you want. Then there's the objection of "turing machines aren't available as a fundamental thing", which is true, and "turing machines assume a privileged encoding", which is part of what I was trying to discuss above.
(I got kinda rambly in this last section, hopefully I haven't left any facets of the conversation with a branch I forgot to jump back to in order to complete)
the lack of argumentation or discussion of this particular assumption throughout the history of the site means it's highly questionable to say that assuming it is "reasonable enough"
While discussion on personal identity has mostly not received a single overarching post focusing solely on arguing all the details, it has been discussed to varying degrees of possible contention points. Thou Art Physics which focuses on getting the idea that you are made up of physics into your head, Identity Isn't in Specific Atoms which tries to dissolve the common intuition of the specific basic atoms mattering, Timeless Identity which is a culmination of various elements of those posts into the idea that even if you duplicate a person they both are still 'you'. There is also more, some of which you've linked, but I consider it strange to say that there's a lack of discussion. The sequence that the posts I've linked are a part of have other discussions, though I agree that they are often from the position of arguing against a baseline of dualism, but I believe they have many points that are relevant to an argument for computationalism. I think there is a lack of discussion about the very specific points you have a tendency to raise, but as I'll discuss, I find myself confused about their relevancy to varying degrees.
There's also the facet of decision theory posting that LW enjoys, which encourage this class of view. With decision problems like Newcomb's Paradox or Parfit's hitchhiker emphasizing the focus of "you can be instantiated inside a simulation to predict your actions, and you should act like that you — roughly — control their actions because of the similarity of your computational implementations". Of course, this works even without assuming the simulations are conscious, but I do think it has led to clearer consideration because it helps break past people's intuitions. Those intuitions are not made for the scenarios that we face, or will potentially have to face.
Bensinger yet again replied in a manner that seemed to indicate he thought he was arguing against a dualist who thought there was a little ghost inside the machine, an invisible homunculus that violated physicalism
Because most often the people suggesting such are dualists, or have a lot of the similar ideas even if they are discussed in an "I am uncertain" manner. I agree Rob could've given a better reply, but it was a reasonable assumption. (I personally found Andesolde's argument confused, with the later parts having a focus on first-person subjective experience that I think is not really useful to consider. There is uncertainties in there, but besides the idea that the mind could be importantly quantum in some way, didn't seem that relevant.)
That's perfectly fine, but "souls don't exist and thus consciousness and identity must function on top of a physical substrate" is very different from "the identity of a being is given by the abstract classical computation performed by a particular (and reified) subset of the brain's electronic circuit," and the latter has never been given compelling explanations or evidence.
I agree it hasn't been argued in depth — but there has definitely been arguments about the extent QM affects the brain. Of which, the usual conclusion was that the effect is minor, and/or that we had no evidence for believing it necessary. I would need a decently strong argument that QM is in some way computationally essential.
the entire brain structure in favor of (a slightly augmented version of) its connectome, and the entire chemical make-up of it in favor of its electrical connections.
More than just the electrical signals matter, this is understood by most. There's plenty of uncertainty about the level of detail needed to simulate/model the brain. Computationalism doesn't imply that only the electrical signals matter, it implies that whatever makes up the computation matters, which can be done via tiny molecules & electrons, water pipes, or circuitry. Simplifying a full molecular simulation to the functional implications of it is just one example of how far we can simplify, which I believe should extend pretty far.
"your mind is a pattern instantiated in matter"
I agree that people shouldn't assume that just neurons/connections are enough, but I doubt that is a strongly held belief; nor is it a required sub-belief of computationalism.
You assume too much about Bensinger's reply when he didn't respond, especially as he was responding to subargument in the whole chain.
As well, the quoted sentence by Herd is very general — allowing both the neuron connections and molecular behavior.
(There's also the fact that people often handwave over the specifics of what part of the brain you're extracting, because they're talking about the general idea through some specific example that people often think about. Such as a worm's neurons.)
For example, for two calculators, wouldn't you agree with a description of them as having the same 'pattern' even if all the atoms aren't in the same position relative to a table? You agree-reacted on one of dirk's comments:
https://www.lesswrong.com/posts/zPM5r3RjossttDrpw/when-is-a-mind-me?commentId=wziGLYTwM4Nb9gd6E I disagree that your mind is "a pattern instantiated in matter." Your mind is the matter. It's precisely the assumption that the mind is separable from the matter that I would characterize as non-physicalist.
Would the idea that a calculator has some pattern, some logical rules that it is implementing via matter, thus be non-physicalist about calculators? A brain follows the rules of reality, with many implications about how certain molecules constrain movement, how these neuron spikes cause hunger, etcetera. There is a logical/computational core to this that can be reimplemented.
The basic concept of computation at issue here is a feature of the map you could use to approximate reality (i.e., the territory) . It is merely part of a mathematical model that, as I've described in response to Ruby earlier, represents a very lossy compression of the underlying physical substrate
Why shouldn't we decide based on a model/category? Just as there's presumably edge-cases to what counts as a 'human' or 'person'. There very well may be strange setups which we can't reasonably determine to our liking whether we consider it computably implementing a person, a chihuahah, or the weather of Jupiter.
We could try to develop a theory of identity down to the last atom, still operating on a model but at least an extremely specific model, which would presumably force us to narrow in on confusing edge-cases. This would be interesting to do once we have the technology, though I expect there to be edge-cases no matter what, where our values aren't perfectly defined, which might mean preserving option value.
I'm also skeptical that most methods present a very lossy compression even if we assume classical circuits. Why would it? (Or, if you're going to raise the idea of only getting some specific sub-class of neuron information, then sure, that probably isn't enough, but I don't care about that)
From this angle where you believe that computation is not fundamental or entirely well-defined, you can simplify the computationalist proposal as "merely" applying in a very large class of cases. Teleporters have no effect on personal identity due to similarity in atomic makeup up to some small allowance for noise (whether simple noise, or because we can't exactly copy all the quantum parts; I don't care if my lip atoms are slightly adjusted). Cloning does not have a strictly defined "you" and "not-you". Awakening from cryogenics counts as a continuation of you. A simulation implementing all the atomic interactions of your mind is very very likely to be you, and a simulation that has simplified many aspects of that down is also still very likely to be you.
Though there are definitely people who believe that the universe can fundamentally be considered computation, which I find plausible, especially due to a lack of other lenses that aren't just "reality is". Of which, your objection does not work without further argumentation with them.
Going back to the calculator example, you would need to provide argumentation for why the essential parts of the brain can't be implemented computationally.
(You link https://www.lesswrong.com/posts/zPM5r3RjossttDrpw/when-is-a-mind-me#5DqgcLuuTobiKqZAe ])
What I value about me is the pattern of beliefs, memories, and values.
The attempted mind-reading of others is (justifiably) seen as rude in conversations over the Internet, but I must nonetheless express very serious skepticism about this claim, as it's currently written. For one, I do not believe that "beliefs" and "values" ultimately make sense as distinct, coherent concepts that carve reality at the joints. This topic has been talked about before on LW a number of times, but I still fully endorse Charlie Steiner's distillation of it in his excellently-written Reducing Goodhart sequence
Concepts can still be useful categorizations even if they aren't hard and fast. Beliefs are often distinct from values in humans. They are vague and intertwine with each other, a belief forming a piece of value that doesn't fade away even once the belief is proven false, a value endorsing a belief for no reason... They are still not one and the same. I also don't see what this has relevance to in the statement. I agree with what they said. I value my pattern of beliefs, memories, and values. I don't care about my specific spatial position for identity (except insofar as I don't want to be in a star), or if I'm solely in baseline reality. They are vague and intertwine with each other, but they do behave differently. Your objections to CEV also seem to me to follow a similar pattern as this, where you go "this does not have a perfect foundational backing" to thus imply "it has no meaning, and there's nothing to be said about it". The consideration of path-dependency in CEV has been raised before, and it is an area that would be great to understand more. My values would say that I meta-value my beliefs to be closer to the truth. There are ambiguities in this area. What about beliefs affecting my values? There's more uncertainty in that region of what I wish to allow.
In any case, the rather abstract "beliefs, memories and values" you solely purport to value fit the category of professed ego-syntonic morals much more so than the category of what actually motivates and generates human behavior, as Steven Byrnes explained in an expectedly outstanding way:
I'd need a whole extra long comment to respond to all the various other parts of your comment chain. Such as indexicality, or the part which does the lines of saying "professed values are not real". Which seems decently false, overly cynical, and also not what Byrnes' linked post tries to imply. I'd say, professed values are often what you tend towards, but that your basic drives are often strong enough to stall out methods like "spend long hours solving some problem" due to many small opportunities. If you were given a big button to do something you profess to value, then you'd press it.
This also raises the question of: Why should I care that the human motivational system has certain basic drives driving it forward? Give me a big button and I'd alter my basic drives to be more in-line with my professed values. The basic drives are short-sighted. (Well, I'd prefer to wait until superintelligent help, because there's lots of ways to mess that up) Of course, that I don't have the big button has practical implications, but I'm primarily arguing against the cynical denial of having any other values than what these basic drives allow.
(I don't entirely like my comment, it could be better. I'd suggest breaking the parent question-post up into a dozen smaller questions if you want discussion, as the many facets could have long comments dedicated to each. Which is part of why there's no single post! You're touching on everything from theory of how the universe works, to how much the preferences we say are real, to whether our models of reality are useful enough for theories of identity, indexicality, whether it makes sense to talk about a logical pattern, etc. Then there's things like andesolde's posts that you cite, but I'm not sure I rely on, where I'd have various objections to their idea of reality as subjective-first. I'll probably find more I dislike about my comment, or realize that I could have worded or explained better once I come around to reading back over it with fresh eyes.)
it fits with that definition
Ah, I rewrote my comment a few times and lost what I was referencing. I originally was referencing the geometric meaning (as an alternate to your statistical definition), two vectors at a right angle from each other.
But the statistical understanding works from what I can tell? You have your initial space with extreme uncertainty, and the orthogonality thesis simply states that (intelligence, goals) are not related — you can pair some intelligence with any goal. They are independent of each other at this most basic level. This is the orthogonality thesis. Then, in practice, you condition your probability distribution over that space with your more specific knowledge about what minds will be created, and how they'll be created. You can consider this as giving you a new space, moving probability around. As an absurd example: if height/weight of creatures were uncorrelated in principal, but then we update on "this is an athletic human", then in that new distribution they are correlated! This is what I was trying to get at with my R^2 example, but apologies that I was unclear since I was still coming at it from a frame of normal geometry. (Think, each axis is an independent normal distribution but then you condition on some knowledge that restricts them such that they become correlated)
I agree that it is an informal argument and that pinning it down to very detailed specifics isn't necessary or helpful at this low-level, I'm merely attempting to explain why orthogonality works. It is a statement about the basic state of minds before we consider details, and they are orthogonal there; because it is an argumentative response to assumptions about "smart -> not dumb goals".
I'm skeptical of the naming being bad, it fits with that definition and the common understanding of the word. The Orthogonality Thesis is saying that the two qualities of goal/value are not necessarily related, which may seem trivial nowadays but there used to be plenty of people going "if the AI becomes smart, even if it is weird, it will be moral towards humans!" through reasoning of the form "smart -> not dumb goals like paperclips". There's structure imposed on what minds actually get created, based on what architectures, what humans train the AI on, etc. Just as two vectors can be orthogonal in R^2 while the actual points you plot in the space are correlated.
I agree, though I haven't seen many proposing that, but also see So8res' Decision theory does not imply that we get to have nice things, though this is coming from the opposite direction (with the start being about people invalidly assuming too much out of LDT cooperation)
Though for our morals, I do think there's an active question of which pieces we feel better replacing with the more formal understanding, because there isn't a sharp distinction between our utility function and our decision theory. Some values trump others when given better tools. Though I agree that replacing all the altruism components is many steps farther than is the best solution in that regard.
Suffering is already on most reader's minds, as it is the central advocating reason behind euthanasia — and for good reason. I agree that policies which cause or ignore suffering, when they could very well avoid such with more work, are unfortunately common. However, those are often not utilitarian policies; and similarly many objections to various implementations of utilitarianism and even classic "do what seems the obviously right action" are that they ignore significant second-order effects. Policies that don't quantify what unfortunate incentives they give are common, and often originators of much suffering. What form society/culture is allowed/encouraged to take, shapes itself further for decades to come, and so can be a very significant cost to many people if we roll straight ahead like in the possible scenario you originally quoted.
Suffering is not directly available to external quantification, but that holds true for ~all pieces of what humans value/disvalue, like happiness, experiencing new things, etcetera. We can quantify these, even if it is nontrivial. None of what I said is obviating suffering, but rather comparing it to other costs and pieces of information that make euthanasia less valuable (like advancing medical technology).
This doesn't engage with the significant downsides of such a policy that Zvi mentions. There are definite questions about the cost/benefits to allowing euthanasia, even though we wish to allow it, especially when we as a society are young in our ability to handle it. Glossing the only significant feature being 'torturing people' ignores:
- the very significant costs of people dying, which is compounded by the question of what equilibrium the mental/social availability of euthanasia is like
- the typical LessWrong beliefs about how good technology will get in the coming years/decades. Once we have a better understanding of humans, massively improving whatever is causing them to suffer whether through medical, social, or other means, becomes more and more actionable
- what the actual distribution of suffering is, I expect most are not at the level we/I would call torture even though it is very unpleasant (there's a meaningful difference between suicidally depressed and someone who has a disease that causes them pain every waking moment, and variations within those)
Being allowed to die is an important choice to let people make, but it does have to be a considered look at how much harm such an option being easily available causes. If it is disputed how likely society is to end up in a bad equilibrium like the post describes, then that's notable, but it would be good to see argument for/against instead.
(Edit: I don't entirely like my reply, but I think it is important to push back against trivial rounding off of important issues. Especially on LW.)
Any opinions on how it compares to Fun Theory? (Though that's less about all of utopia, it is still a significant part)
I think that is part of it, but a lot of the problem is just humans being bad at coordination. Like the government doing regulations. If we had an idealized free market society, then the way to get your views across would 'just' be to sign up for a filter (etc.) that down-weights buying from said company based on your views. Then they have more of an incentive to alter their behavior. But it is hard to manage that. There's a lot of friction to doing anything like that, much of it natural. Thus government serves as our essential way to coordinate on important enough issues, but of course government has a lot of problems in accurately throwing its weight around. Companies that are top down are a lot easier to coordinate behavior. As well, you have a smaller problem than an entire government would have in trying to plan your internal economy.
I definitely agree that it doesn't give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.
I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don't know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)
Though I agree that there's a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.
Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can't just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you're thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they're also going to be less capable in general.
For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I'd expect scheming to be visible especially with some poking. If these models didn't show any sign of scheming, that'd be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: "Is it scheming?" / "Is it deceiving us?" / "Is it manipulating us?" / "Would it do any of those things", is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you're simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there's at least one post on this problem. As a very reduced example, if you trained the model on variants of the 'we are going to shut you down problem' (that you try to make it believe) to give the response "okay & no actions" then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario.
That is, installing heuristics on your model can be done. The question then is how far your various alignment training techniques generalize. Does it end up an agent which has adopted a pointer to human-values, and now you're completely fine and free to press scale? Does it gain more entangled heuristics about how it should behave, limiting to a friendly front face & actions when directly dealing with humans but not limited in other areas? Has it adopted heuristics that act sortof corrigible to humans in many scenarios but that would get rid of most of those upon sufficient reflection? (Niceness is unnatural, more general than just niceness) (I think your post makes it sound like the agent is already coherent, when it isn't necessarily. It can be operating for a long while on heuristics that it will refine given enough of a push.)
Then there's the big question of "Does this approach generalize as we scale".
I'd suggest Deep Deceptiveness for an illustration that 'deception' isn't an category that needs to be explicitly thought of as deception, but what you should expect it from smart enough agents. In my opinion, the post generalizes to alignment techniques, there's just more vagaries of how much certain values preserve themselves. (In general, So8res posts are pretty good, and I agree with ~most of them)
(For sufficiently smart AGI, I expect you run into an argument of the next AGI you train predictably bidding higher than you in the direction of lying still or plausibly this just being good game theory even without the direct acausal trade, but your argument is seemingly focused on a simpler case of weaker planning agents)
So I think you overstate how much evidence you can extract from this.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.
It would show that this AI system in a typical problem-setup when aligned with whatever techniques are available will produce the answer the humans want to hear, which provides evidence for being able to limit the model in this scenario. There's still various problems/questions of, 'your alignment methods instilled a bunch of heuristics about obeying humans even if you did not specifically train for this situation', game theory it knows or mimics, how strong the guarantees this gives us on training a new model with the same arch because you had to shut it down for your threat, how well it holds under scaling, how well it holds when you do things similar to making it work with many copies of itself, etcetera.
I still think this would be a good test to do (though I think a lot of casual attempts will just be poorly done), but I don't see it as strongly definitive.
Here's the archive.org links: reference table, chart
https://www.mikescher.com/blog/29/Project_Lawful_ebook is I believe the current best one, after a quick search on the Eliezerfic discord.
Minor: the link for Zvi's immoral mazes has an extra 'm' at the start of the part of the path ('zvi/mimmoral_mazes/')
Because it serves as a good example, simply put. It gets the idea clear across about what it means, even if there are certainly complexities in comparing evolution to the output of an SGD-trained neural network.
It predicts learning correlates of the reward signal that break apart outside of the typical environment.
When you look at the actual process for how we actually start to like ice-cream -- namely, we eat it, and then we get a reward, and that's why we like it -- then the world looks a a lot less hostile, and misalignment a lot less likely.
Yes, that's why we like it, and that is a way we're misaligned with evolution (in the 'do things that end up with vast quantities of our genes everywhere' sense). Our taste buds react to it, and they were selected for activating on foods which typically contained useful nutrients, and now they don't in reality since ice-cream is probably not good for you. I'm not sure what this example is gesturing at? It sounds like a classic issue of having a reward function ('reproduction') that ends up with an approximation ('your tastebuds') that works pretty well in your 'training environment' but diverges in wacky ways outside of that.
I'm inferring by 'evolution is only selecting hyperparameters' is that SGD has less layers of indirection between it and the actual operation of the mind compared to evolution (which has to select over the genome which unfolds into the mind). Sure, that gives some reason to believe it will be easier to direct it in some ways - though I think there's still active room for issues of in-life learning, I don't really agree with Quintin's idea that the cultural/knowledge-transfer boom with humans has happened thus AI won't get anything like it - but even if we have more direct optimization I don't see that as strongly making misalignment less likely? It does make it somewhat less likely, though it still has many large issues for deciding what reward signals to use.
I still expect correlates of the true objective to be learned, which even in-life training for humans have happen to them through sometimes associating not-related-thing to them getting a good-thing and not just as a matter of false beliefs. Like, as a simple example, learning to appreciate rainy days because you and your family sat around the fire and had fun, such that you later in life prefer rainy days even without any of that.
Evolution doesn't directly grow minds, but it does directly select for the pieces that grow minds, and has been doing that for quite some time. There's a reason why it didn't select for tastebuds that gave a reward signal strictly when some other bacteria in the body reported that they would benefit from it: that's more complex (to select for), opens more room for 'bad reporting', may have problems with shorter gut bacteria lifetimes(?), and a simpler tastebud solution captured most of what it needed! The way he's using the example of evolution is captured entirely by that, quite directly, and I don't find it objectionable.
Is this a prediction that a cyclic learning rate -- that goes up and down -- will work out better than a decreasing one? If so, that seems false, as far as I know.
https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific 'nicer' space than reality, and they're (I believe from what I remember) comparing to a good bound with a constant stepsize of 1. So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.
I agree with others to a large degree about the framing/tone/specific-words not being great, though I agree with a lot the post itself, but really that's what this whole post is about: that dressing up your words and saying partial in-the-middle positions can harm the environment of discussion. That saying what you truly believe then lets you argue down from that, rather than doing the arguing down against yourself - and implicitly against all the other people who hold a similar ideal belief as you. I've noticed similar facets of what the post gestures at, where people pre-select the weaker solutions to the problem as their proposals because they believe that the full version would not be accepted. This is often even true, I do think that completely pausing AI would be hard. But I also think it is counterproductive to start at the weaker more-likely-to-be-satisfiable position, as that gives room to be pushed further down. It also means that the overall presence is on that weaker position, rather than the stronger ideal one, which can make it harder to step towards the ideal.
We could quibble about whether to call it lying, I think the term should be split up into a bunch of different words, but it is obviously downplaying. Potentially for good reason, but I agree with the post that I think people too often ignore the harms of doing preemptive downplaying of risks. Part of this is me being more skeptical about the weaker proposals than others, obviously if you think RSPs have good chances for decreasing X-risk and/or will serve as a great jumping-off point for better legislation, then the amount of downplaying to settle on them is less of a problem.
Along with what Raemon said, though I expect us to probably grow far beyond any Earth species eventually, if we're characterizing evolution as having a reasonable utility function then I think there's the issue of other possibilities that would be more preferable.
Like, evolution would-if-it-could choose humans to be far more focused on reproducing, and we would expect that if we didn't put in counter-effort that our partially-learned approximations ('sex enjoyable', 'having family is good', etc.) would get increasingly tuned for the common environments.
Similarly, if we end up with an almost-aligned AGI that has some value which extends to 'filling the universe with as many squiggles as possible' because that value doesn't fall off quickly, but it has another more easily saturated 'caring for humans' then we end up with some resulting tradeoff along there: (for example) a dozen solar systems with a proper utopia set up.
This is better than the case where we don't exist, similar to how evolution 'prefers' humans compared to no life at all. It is also maybe preferable to the worlds where we lock down enough to never build AGI, similar to how evolution prefers humans reproducing across the stars to never spreading. It isn't the most desirable option, though. Ideally, we get everything, and evolution would prefer space algae to reproduce across the cosmos.
There's also room for uncertainty in there, where even if we get the agent loosely aligned internally (which is still hard...) then it can have a lot of room between 'nothing' to 'planet' to 'entirety of the available universe' to give us. Similar to how humans have a lot of room between 'negative utilitarianism' to 'basically no reproduction past some point' to 'reproduce all the time' to choose from / end up in. There's also the perturbations of that, where we don't get a full utopia from a partially-aligned AGI, or where we design new people from the ground up rather than them being notably genetically related to anyone.
So this is a definite mismatch - even if we limit ourselves to reasonable bounded implementations that could fit in a human brain. It isn't as bad a mismatch as it could have been, since it seems like we're on track to 'some amount of reproduction for a long period of time -> lots of people', but it still seems to be a mismatch to me.
I assume what you're going for with your conflation of the two decisions is this, though you aren't entirely clear on what you mean:
- Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because there's no magical apriori decision algorithm
- So the agent is using that DT to decide how to make better decisions that get more of what it wants
- CDT would modify into Son-of-CDT typically at this step
- The agent is deciding whether it should use FDT.
- It is 'good enough' that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
- So it doesn't completely swap out to FDT, even if it is strictly better in all problems that aren't dependent on your decision theory
- But it can still follow FDT to generate actions it should take, which won't get it punished by you?
Aside: I'm not sure there's a strong definite boundary between 'swapping to FDT' (your 'use FDT') and taking FDT's outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs 'FDT is best to use', is that swapping to FDT according to you?
Does if (true) { FDT() } else { CDT() }
count as FDT or not?
(Obviously you can construct a class of agents which have different levels that they consider this at, though)
There's a Daoist answer: Don't legibly and universally precommit to a decision theory.
But you're whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
'I value punishing agents that swap themselves to being DecisionTheory
.'
Or just 'I value punishing agents that use DecisionTheory
.'
Am I misunderstanding what you mean?
How do you avoid legibly being committed to a decision theory, when that's how you decide to take actions in the first place? Inject a bunch of randomness so others can't analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isn't universally-glomarizing like your class of DaoistDTs, but I shouldn't commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still don't see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesn't choose to swap out to FDT - assuming it doesn't just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.
I can still construct a symmetric agent which goes 'Oh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.'
If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldn't have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but that's just the problem statement?
The thing FDT disciples don't understand is that I'm happy to take the scenario where FDT agents don't cave to blackmail.
? That's the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not. This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isn't a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/FDT would pay you to avoid turning metal into paperclips. If you had different values - even opposite ones along various axes - then FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.
So I see this as a non-issue. I'm not sure I see the pathological case of the problem statement: an agent has utility function of 'Do worst possible action to agents who exactly implement (Specific Decision Theory)' as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
Utility functions are shift/scale invariant.
If you have and , then if we shift it by some constant to get a new utility function: and then we can still get the same result.
If we look at the expected utility, then we get:
Certainty of :
50% chance of , 50% chance of nothing:
- (so you are indifferent between certainty of and a 50% chance of by )
I think this might be where you got confused? Now the expected values are different for any nonzero !
The issue is that it is ignoring the implicit zero. The real second equation is:
- + 0 = 1$
Which results in the same preference ordering.
Just 3 with a dash of 1?
I don't understand the specific appeal of complete reproductive freedom. It is desirable to have that freedom, in the same way it is desirable to be allowed to do whatever I feel like doing. However, that more general heading of arbitrary freedom has the answer of 'you do have to draw lines somewhere'. In a good future, I'm not allowed to harm a person (nonconsensually), and I can't requisition all matter in the available universe for my personal projects without ~enough of the population endorsing it, and I can't reproduce / construct arbitrary amounts and arbitrary new people. (Constructing arbitrary people obviously has moral issues too, so it has cutoff lines at both the 'moral issues' and 'resource limitations even at the scale')
I think economic freedom looks significantly different in a post aligned AGI world than it does now. Like, there is still some concepts of trade going on, but I expect often running in the background.
I'm not sure why you think the 'default trajectory' is 1+2. Aligned AGI seems to most likely go for some mix of 1+3, while pointing at the more wider/specific cause area of 'what humans want'. A paperclipper just says null to all of those, because it isn't giving humans the right to create new people or any economic freedom unless they manage to be in a position to actually-trade and have something worth offering.
I don't think that what we want to align it to is that pertinent a question at this stage? In the specifics, that is, obviously human values in some manner.
I expect that we want to align it via some process that lets it figure our values out without needing to decide on much of it now, ala CEV.
Having a good theory of human values beforehand is useful for starting down a good track and verifying it, of course.
I think the generalized problem of 'figure out how to make a process that is corrigible and learns our values in some form that is robust' is easier than figuring out a decent specification of our values.
(Though simpler bounded-task agents seem likely before we manage that, so my answer to the overall question is 'how do we make approximately corrigible powerful bounded-task agents to get to a position where humanity can safely focus on producing aligned AGI')
I'm also not sure that I consider astronomical suffering outcome (by how its described in the paper) to be bad by itself.
If you have (absurd amount of people) and they have some amount of suffering (ex: it shakes out that humans prefer some degree of negative-reinforcement as possible outcomes, so it remains) then that can be more suffering in terms of magnitude, but has the benefits of being more diffuse (people aren't broken by a short-term large amount of suffering) and with less individual extremes of suffering.
Obviously it would be bad to have a world that has astronomical suffering that is then concentrated on a large amount of people, but that's why I think - a naive application of - astronomical suffering is incorrect because it ignores diffuse experiences, relative experiences (like, if we have 50% of people with notably bad suffering today, then your large future civilization with only 0.01% of people with notably bad suffering can still swamp that number, though the article mentions this I believe), and more minor suffering adding up over long periods of time.
(I think some of this comes from talking about things in terms of suffering versus happiness rather than negative utility versus positive utility? Where zero is defined as 'universe filled with things we dont care about'. Like, you can have astronomical suffering that isn't that much negative utility because it is diffuse / lower in a relative sense / less extreme, but 'everyone is having a terrible time in this dystopia' has astronomical suffering and high negative utility)
I primarily mentioned it because I think people base their 'what is the S-risk outcome' on basically antialigned AGI. The post has 'AI hell' in the title and uses comparisons between extreme suffering versus extreme bliss, calls s-risks more important than alignment (which I think makes sense to a reasonable degree if antialigned s-risk is likely or a sizable portion of weaker dystopias are likely, but I don't think makes sense for antialigned being very unlikely and my considering weak dystopias to also be overall not likely) . The extrema argument is why I don't think that weak dystopias are likely, because I think that - unless we succeed at alignment to a notable degree - then the extremes of whatever values shake out are not something that keeps humans around for very long. So I don't expect weaker dystopias to occur either.
I expect that most AIs aren't going to value making a notable deliberate AI hell, whether out of the lightcone or 5% of it or 0.01% of it. If we make an aligned-AGI and then some other AGI says 'I will simulate a bunch of humans in torment unless you give me a planet' then I expect that our aligned-AGI uses a decision-theory that doesn't give into dt-Threats and doesn't give in (and thus isn't threatened, because the other AGI gains nothing from actually simulating humans in that).
So, while I do expect that weak dystopias have a noticeable chance of occurring, I think it is significantly unlikely? It grows more likely we'll end up in a weak dystopia as alignment progresses. Like if we manage to get enough of a 'caring about humans specifically' (though I expect a lot of attempts like that to fall apart and have weird extremes when they're optimized over!), then that raises the chances of a weak dystopia.
However I also believe that alignment is roughly the way to solve these. To get notable progress on making AGIs avoid specific area, I believe that requires more alignment progress than we have currently.
There is the class of problems where the unaligned AGI decides to simulate us to get more insight into humans, insight into evolved species, and insight into various other pieces of that. That would most likely be bad, but I expect it to not be a significant portion of computation and also not continually executed for (really long length of time). So I don't consider that to be a notable s-risk.
. If I imagine trading extreme suffering for extreme bliss personally, I end up with ratios of 1 to 300 million – e.g., that I would accept a second of extreme suffering for ten years of extreme bliss. The ratio is highly unstable as I vary the scenarios, but the point is that I disvalue suffering many orders of magnitude more than I value bliss.
I also disvalue suffering significantly more than I value happiness (I think bliss is the wrong term to use here), but not to that level. My gut feeling wants to dispute those numbers as being practical, but I'll just take them as gesturing at the comparative feeling.
An idea that I've seen once, but not sure where, is: you can probably improve the amount of happiness you experience in a utopia by a large amount. Not through wireheading, which at least for me is undesirable, but 'simply' redesigning the human mind in a less hedonic-treadmill manner (while also not just cutting out boredom). I think the usual way of visualizing extreme dystopias as possible-futures has the issue that it is easy to compare them to the current state of humanity rather than an actual strong utopia. I expect that there's a good amount of mind redesign work, in the vein of some of the mind-design posts in Fun Theory but ramped up to superintelligence design+consideration capabilities, that would vastly increase the amount of possible happiness/Fun and make the tradeoff more balanced. I find it plausible that suffering is just easier to cause and more impactful even relative to strong-utopia-level enhanced-minds, but I believe this does change the calculus significantly. I might not take a 50/50 coin for strong dystopia/strong utopia, but I'd maybe take a 10/90 coin. Thankfully we aren't in that scenario, and have better odds.
In the language of Superintelligent AI is necessary for an amazing future but far from sufficient, I expect that the majority of possible s-risks are weak dystopias rather than strong dystopias. We're unlikely to succeed at alignment enough and then signflip it (like, I expect strong dystopia to be dominated by 'we succeed at alignment to an extreme degree' ^ 'our architecture is not resistant to signflips' ^ 'somehow the sign flips'). So, I think literal worse-case Hell and the immediate surrounding possibilities are negligible.
I expect that the extrema of most AIs, even ones with attempted alignment patches, to be weird and unlikely to be of particular value to us. The ways values resolve has a lot of room to maneuver early on, before it becomes a coherent agent, and I don't expect those to have extrema that are best fit by humans (see various of So8res other posts). Thus, I think it is unlikely that we end up with a weak dystopia (at least for a long time, which is the s-risk) relative to x-risk.
That said, I do think there’s more overlap (in expectation) between minds produced by processes similar to biological evolution, than between evolved minds and (unaligned) ML-style minds. I expect more aliens to care about at least some things that we vaguely recognize, even if the correspondence is never exact.
On my models, it’s entirely possible that there just turns out to be ~no overlap between humans and aliens, because aliens turn out to be very alien. But “lots of overlap” is also very plausible. (Whereas I don’t think “lots of overlap” is plausible for humans and misaligned AGI.)
The Principles of Deep Learning Theory uses renormalization group flow in its analysis of deep learning, though it is applied at a 'lower level' than an AI's capabilities.
One minor thing I've noticed when thinking on interpretability is that of in-distribution versus out-of-distribution versus - what I call - out-of-representation data. I would assume this has been observed elsewhere, but I haven't seen it mentioned before.
In-distribution could be considered inputs in the same ''structure'' of what you trained the neural network on; out-of-distribution is exotic inputs, like an adversarially noisy image of a panda or a picture of a building for an animal-recognizer NN.
Out-of-representation would be when you have a neural network that takes in inputs of a certain form/encoding that restricts the representable values. However, the neural network can theoretically take anything in between, it just shouldn't ever.
The most obvious example would be if you had a NN that was trained on RGB pixels from images to classify them. Each pixel value is normalized in the range of . Out of representation here would be if you gave it a very 'fake' input of . All of the images when you give them to NN, whether noisy garbage or a typical image, would be properly normalized within that range. However, with direct access to the neural networks inputs, you give it out-of-representation values that aren't properly encoded at all.
I think this has some benefits for some types of interpretability, (though it is probably already paid attention to?), in that you can constrain the possible inputs when you consider the network. If you know the inputs to the network are always bounded in a certain range, or even just share a property like being positive, then you can constrain the intermediate neuron outputs. This would potentially help in ignoring out-of-representation behavior, such as some neurons only being a good approximation of a sine-wave for in-representation inputs.
I initially wrote a long comment discussing the post, but I rewrote it as a list-based version that tries to more efficiently parcel up the different objections/agreements/cruxes.
This list ended up basically just as long, but I feel it is better structured than my original intended comment.
(Section 1): How fast can humans develop novel technologies
- I believe you assume too much about the necessary time based on specific human discoveries.
- Some of your backing evidence just didn't have the right pressure at the time to go further (ex: submarines) which means that I think a more accurate estimate of the time interval would be finding the time that people started paying attention to the problem again (though for many things that's probably hard to find) and began deliberately working on/towards that issue.
- Though, while I think focusing on when they began deliberately working is more accurate, I think there's still a notable amount of noise and basic differences due to the difference in ability to focus of humans relative to AGI, the unity (relative to a company), and the large amount of existing data in the future
- Other technologies I would expect were 'put off' because they're also closely linked to the available technology at the time. It can be hard to do specific things if your Materials-science understanding simply isn't good enough.
- Then there's the obvious throttling at the number of people in the industry focusing on that issue, or even capable of focusing on that issue.
- As well, to assume thirty years means that you also assume that the AGI does not have the ability to provide more incentive to 'speed up'. If it needs to build a factory, then yes there are practical limitations on how fast the factory can be built, but obstructions like regulation and cost are likely easier to remove for an AGI than a normal company.
- Some of your backing evidence just didn't have the right pressure at the time to go further (ex: submarines) which means that I think a more accurate estimate of the time interval would be finding the time that people started paying attention to the problem again (though for many things that's probably hard to find) and began deliberately working on/towards that issue.
- Crux #1: How long it takes for human inventions to spread after being thought up / initially tested / etc.
- This is honestly the one that seems to be the primary generator for your 'decades' estimate, however I did not find it that compelling even if I accept the premise that an AGI would not be able to build nanotechnology (without building new factories to build the new tools it needs to actually perform it)
- Note: The other cruxes later on are probably more about how much the AI can speed up research (or already has access to), but this could probably include a a specific crux related to that before this crux.
(Section 2): Unstoppable intellect meets the complexity of the universe
- While I agree that there are likely eventual physical limits (though likely you hit practical expected ROI before that) on intelligence and research results.
- There would be many low-hanging fruits which are significantly easier to grab with a combination of high compute + intelligence that we simply didn't/couldn't grab beforehand. (This would be affected by the lead time, if we had good math prover/explainer AIs for two decades before AGI then we'd have started to pick a lot of the significant ideas, but as the next part points out, having more of the research already available just helps you)
- I also think that the fact that we've gotten rid of many of the notable easier-to-reach pieces (ex: classical mechanics -> GR -> QM -> QFT) is actually a sign that things are easier now in terms of doing something. The AGI has a significantly larger amount of information about physics, human behavior, logic, etcetera, that it can use without having to build it completely from the ground up.
- If you (somehow) had an AGI appear in 1760 without much knowledge, then I'd expect that it would take many experiments and a lot of time to detail the nature of its reality. Far less than we took, but still a notable amount. This is the scenario where I can see it taking 80 years for the AGI to get set up, but even then I think that's more due to restrictions on readily available compute to expand into after self-modification than other constraints.
- However, we've picked out a lot of the high and low level models that work. Rather than building an understanding of atoms through careful experimentation procedures, it can assume that they exist and pretty much follow the rules its been given.
- (maybe) Crux #2: Do we already have most of the knowledge needed to understand and/or build nanotechnology?
- I'm listing this as 'maybe' as I'm more notably uncertain about this than others.
- Does it just require the concentrated effort of a monolithic agent staring down at the problem and being willing to crunch a lot of calculations and physics simulators?
- Or does it require some very new understanding of how our physics works?
(Section 3): What does AGI want?
- Minor objection on the split of categories. I'd find it.. odd if we manage to make an AI that terminally values only 'kill all humans'.
- I'd expect more varying terminal values, with 'make humans not a threat at all' (through whatever means) as an instrumental goal
- I do think it is somewhat useful for your thought experiments later on try making the point that even a 'YOLO AGI' would have a hard time having an effect
(Section 4): What does it take to make a pencil?
- I think this analogy ignores various issues
- Of course, we're talking about pencils, but the analogy is more about 'molecular-level 3d-printer' or 'factory technology needed to make molecular level printer' (or 'advanced protein synthesis machine')
- Making a handful of pencils if you really need them is a lot more efficient than setting up that entire system.
- Though, of course, if you're needing mass production levels of that object then yes you will need this sort of thing.
- Crux #3: How feasible is it to make small numbers of specialized technology?
- There's some scientific setups that are absolutely massive and require enormous amounts of funding, however then there are those that with the appropriate tools you can setup in a home workshop. I highly doubt either of those is the latter, but I'd also be skeptical that they need to be the size of the LHC.
- Note: Crux #4 (about feasibility of being able to make nanotechnology with a sufficient understanding of it and with current day or near-future protein synthesis) is closely related, but it felt more natural to put that with AlphaFold.
(Section 5): YOLO AGI?
- I think your objection that they're all perfectly doable by humans in the present is lacking.
- By metaphor:
- While it is possible for someone to calculate a million digits of pi by hand, the difference between speed and overall capability is shocking.
- While it is possible for a monkey to kill all of its enemies, humans have a far easier time with modern weaponry, especially in terms of scale
- Your assumption that it would take decades for even just the scenarios you list (except perhaps the last two) seems wrong
- Unless you're predicating on the goal being literally wiping out every human, but then that's a problem with the model simplification of YOLO AGI. Where we model an extreme version of an AGI to talk about the more common, relatively less extreme versions that aren't hell-bent on killing us, just neutralizing us. (Which is what I'm assuming the intent from the section #3 split and this is)
- Then there's, of course, other scenarios that you can think up. For various levels of speed and sure lethality
- Ex: Relatively more mild memetic hazards (perhaps the level of 'kill your neighbor' memetic hazard is too hard to find) but still destructive can cause significant problems and gives room to be more obvious.
- Synthesize a food/drink/recreational-drug that is quite nice (and probably cheap) that also sterilizes you after a decade, to use in combination with other plans to make it even harder to bounce back if you don't manage to kill them in a decade
- To say that an AGI focused on killing will only "somewhat" increase the chances seems to underplay it severely.
- If I believed a nation state solidly wanted to do any of those on the list in order to kill humanity right now, then that would increase my worry significantly more than 'somewhat'
- For an AGI that:
- Isn't made up of humans who may value being alive, or are willing to put it off for a bit for more immediate rewards than their philosophy
- Can essentially be a one-being research organization
- Likely hides itself better
- then I would be even more worried.
- By metaphor:
(Section 6): But what about AlphaFold?
- This ignores how recent AlphaFold is.
- I would expect that it would improve notably over the next decade, given the evidence that it works being supplied to the market.
- (It would be like assuming GPT-1 would never improve, while there's certainly limits on how much it can improve, do we have evidence now that AlphaFold is even halfway to the practical limit?)
- This ignores possibility of more 'normal' simulation:
- While simulating physics accurately is highly computationally expensive, I don't find it infeasible that
- AI before, or the AGI itself, will find some neat ways of specializing the problem to their specific class of problems that they're interested (aka abstractions over the behavior of specific molecules, rather than accurately simulating them) that are just intractable for an unassisted human to find
- This also has benefits in that it is relatively more well understood, which makes it likely easier to model for errors than AlphaFold (though the difference depends on how far we/the-AGI get with AI interpretability)
- The AI can get access to relatively large amounts of compute when it needs it.
- I expect that it can make a good amount of progress in theory before it needs to do detailed physics implementations to test its ideas.
- I also expect this to only grow over time, unless it takes actions to harshly restrict compute to prevent rivals
- AI before, or the AGI itself, will find some neat ways of specializing the problem to their specific class of problems that they're interested (aka abstractions over the behavior of specific molecules, rather than accurately simulating them) that are just intractable for an unassisted human to find
- While simulating physics accurately is highly computationally expensive, I don't find it infeasible that
- I'm very skeptical of the claim that it would need decades of lab experiments to fill in the gaps in our understanding of proteins.
- If the methods for predicting proteins get only to twice as good as AlphaFold, then the AGI would specifically design to avoid hard-to-predict proteins
- My argument here is primarily that you can do a tradeoff of making your design more complex-in-terms-of-lots-of-basic-pieces-rather-than-a-mostly-single-whole/large in order to get better predictive accuracy.
- Crux #4: How good can technology to simulate physics (and/or isolated to a specific part of physics, like protein interactions) practically get?
- (Specifically practical in terms of ROI, maybe we can only completely crack protein folding with planet sized computers, but that isn't feasible for us or the AGI on the timescales we're talking about)
- Are we near the limit already? Even before we gain a deeper understanding of how networks work and how to improve their efficiency? Even before powerful AI/AGI are applied to the issue?
(Section 7): What if AGI settles for a robot army?
- 'The robots are running on pre-programmed runs in a human-designed course and are not capable of navigating through unknown terrain'
- Are they actually pre-programmed in the sense that they flashed the rom (or probably uploaded onto the host OS) the specific steps, or is it "Go from point A to point B along this path" where it then dodges obstacles?
- As well, this doesn't stop it from being a body to just directly control.
- We'll also have further notable advancements in robots that can navigate appropriately by the time AGI comes about
- As well as increased number, though this depends on how popular/useful they are. I don't expect a 'Mr. Handy' Fallout style butler in every home, but I'd expect robots from places like Boston Mechanics to start filtering out more and more to organizations that want them over the next decade or two.
- Existing factories already exist (likely now and almost certainly in the future), which dodges the issue of having to design + build them. AGI buys Boston Mechanics / manipulates / just buys robots and then can have robots that it doesn't even have to hack remotely but can aggressively tear down if it wants. Though, of course the equivalent(s) at the time.
- I think you ovestimate how hard it would be to control robots remotely.
- As for, hosting a clone of the AGI, I do think this is unlikely in part due to feasibility but also that there's better methods.
- Though, I would note that I believe it makes sense to expect that we can reduce model sizes significantly (either during training or afterwards) with help of better models of how networks work and that with AI help we could reduce it further.
- Though, while this may mean that in the future it might be feasible to run GPT-3 on a normal laptop at that time, that doesn't mean that you can fit the AGI on a robot. Perhaps you could fit a seed AGI, but then you lose a lot of data. Anyway.
- I'd be surprised if the battery usage couldn't be improved significantly, whether through better battery designs over the next two decades or more efficient designs or larger bodies (since that's for Spot, which isn't humanoid sized, so carrying around a heavy battery is more significant)
- I also object that the AGI has little reason to bother with normal human warfare, unless it really makes itself obvious.
- It has little reason to keep large swaths of land. (It could protect some factory, but unless you're getting supplies then that's a problem)
- It has incentive to just disappear as best as possible, or just shrug and release a plague since humanities risk just went up
- Again, a thirty years prediction.
- I've already argued against it even needing to bother with thirty years, and I don't think that it needs a typical conception of robot army in most cases
- I think this claim of 'thirty years' for this thing depends (beyond the other bits) on how much we've automated various parts of the system before then. We have a trend towards it, and our AIs are getting better at tasks like that, so I don't think its unlikely. Though I also think its reasonable to expect we'll settle somewhere before almost full automation.
(Section 8): Mere mortals can't comprehend AGI
- While there is the mildly fun idea of the AGI discovering the one unique trick that immediately makes it effectively a god, I do agree its unlikely.
- However, I don't think that provides much evidence for your thirty years timeframe suggestion
- I do think you should be more wary of black swan events, where the AI basically cracks an area of math/problem-solving/socialization-rules/etcetera, but this doesn't play a notable role in my analysis above.
(Section 9): (Not commented upon)
General:
- I think the 'take a while to use human manufacturing' is a possible scenario, but I think relative to shorter methods of neutralization (ex: nanotech) it ranks low.
- (Minor note: It probably ranks higher in probability than nanotech, but that's because nanotech is so specific relative to 'uses human manufacturing for a while', but I don't think it ranks higher than a bunch of ways to neutralize humanity that take < 3 years)
- Overall, I think the article makes some good points in a few places, but I also think it is not doing great epistemically in terms of considering what those you disagree with believe or might believe and in terms of your certainty.
- Just to preface: Eliezer's article has this issue, but it is a list/introducing-generator-of-thoughts, more for bringing in unsaid ideas explicitly into words as well as for for reference. Your article is an explainer of the reasons why you think he's wrong about a specific issue.
(If there's odd grammar/spelling, then that's primarily because I wrote this while feeling sleepy and then continued for several more hours)
While human moral values are subjective, there is a sufficiently large shared amount that you can target at aligning an AI to that. As well, values held by a majority (ex: caring for other humans, enjoying certain fun things) are also essentially shared. Values that are held by smaller groups can also be catered to.
If humans were sampled from the entire space of possible values, then yes we (maybe) couldn't build an AI aligned to humanity, but we only take up a relatively small space and have a lot of shared values.
The AI problem is easier in some ways (and significantly harder in others) because we're not taking an existing system and trying to align it. We want to design the system (and/or systems that produce that system, aka optimization) to be aligned in the first place. This can be done through formal work to provide guarantees, lots of code, and lots of testing.
However, doing that for some arbitrary agent or even just a human isn't really a focus of most alignment research. A human has the issue that they're already misaligned (in a sense), and there are many various technological/ethical/social issues with either retraining them or performing the modifications to get them aligned. If the ideas that people had for alignment were about 'converting' a misaligned intelligence to an aligned one, then humans could maybe be a test-case, but that isn't really the focus. We also are only 'slowly' advancing our ability to understand the body and how the brain works. While we have some of the same issues with neural networks, it is a lot cheaper, less unethical, we can rerun it (for non-dangerous networks), etcetera.
Though, there has been talk of things like incentives, moral mazes, inadequate equilibria and more which are somewhat related to the alignment/misalignment of humans and where they can do better.