Posts
Comments
Yes, certain places like preschools might benefit even from an isolated install.
But that is kind of exceptional.
The world isn't an efficient market, especially because people are kind of set in their ways and like to stick to the defaults unless there is strong social pressure to change.
Far-UVC probably would have a large effect if a particular city or country installed it.
But if only a few buildings install it, then it has no effect because people just catch the bugs elsewhere.
Imagine the effect of just treating sewage from one house, and leaving all the untreated sewage from a million houses untreated in the river. There would be essentially no effect.
ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.
ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.
I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn't just people getting worried about AI risk.
alignment has always been about doing what the user/operator wants
Well it has often been about not doing what the user wants, actually.
giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her.
Influence over preferences of a single entity is much more conflict-y.
Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.
The point of ELYSIUM is that people get control over non-overlapping places. There are some difficulties where people have preferences over the whole universe. But the real world shows us that those are a smaller thing than the direct, local preference to have your own volcano lair all to yourself.
catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer
And, the problem with saying "OK let's just ban the creation of catgirls" is that then maybe Reedspacer builds a volcano lair just for himself and plays video games in it, and the catgirls whose existence you prevented are going to scream bloody murder because you took away from them a very good existence that they would have enjoyed and also made Reedsapcer sad.
The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.
BPA shouldn't be allowed to want anything for Steve. There shouldn't be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can't even know that Steve exists.
I think the difficult part is when BPA looks at Bob's preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that can be bent just enough to fit into the "not-torture" bucket, and so it synthetically generates some new people and starts inflicting some kind of marginal harm on them.
And I think that this will in fact be a binding constraint on utopia, because most humans will (given the resources) want to make a personal utopia full of other humans that forms a status hierarchy with them at the top. And 'being forced to participate in a status hierarchy that you are not at the top of' is a type of 'generalized consensual harm'.
Even the good old Reedspacer's Lower Bound fits this model. Reedspacer wants a volcano lair full of catgirls, but the catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer. Arguably, that is a form of consensual harm to the catgirls.
So I don't think there is a neat boundary here. The neatest boundary is informed consent, perhaps backed up by some lower-level tests about what proportion of an entity's existence is actually miserable.
If Reedspacer beats his catgirls, makes them feel sad all the time, that matters. But maybe if one of them feels a little bit sad for a short moment that is acceptable.
Steve will never become aware of what Bob is doing to OldSteve
But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?
I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob's Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn't trigger your torture detection test.
This kind of 'marginal case of consensual torture' has popped up in other similar discussions. E.g. In Yvain's (Scott Alexander's) article on Archipelago there's this section:
"""A child who is abused may be too young to know that escape is an option, or may be brainwashed into thinking they are evil, or guilted into believing they are betraying their families to opt out. And although there is no perfect, elegant solution here, the practical solution is that UniGov enforces some pretty strict laws on child-rearing, and every child, no matter what other education they receive, also has to receive a class taught by a UniGov representative in which they learn about the other communities in the Archipelago, receive a basic non-brainwashed view of the world, and are given directions to their nearest UniGov representative who they can give their opt-out request to"""
So Scott Alexander's solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out.
I think the issue here is that "people who unwisely consent to torture even after being told about it" and "people who are willing and consenting submissives" is not actually a hard boundary.
a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority.
If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.
And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn't happen.
Re:threats, see my other comment.
Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.
This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial.
Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime.
In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win.
People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience).
The ability to have civil disobedience without protection rackets does seem kind of crucial.
his AI girlfriend told him to
Which AI told him this? What exactly did it say? Had it undergone RLHF for ethics/harmlessness?
This is not to do with ethics though?
Air Canada Has to Honor a Refund Policy Its Chatbot Made Up
This is just the model hallucinating?
prevention of another Sydney.
But concretely, what bad outcomes eventuated because of Sydney?
Why would less RL on Ethics reduce productivity? Most work-use of AI has nothing to do with ethics.
In fact since RLHF decreases model capability AFAIK, would skipping this actually increase productivity because the models would be better?
One principled way to do it would be simulated war on narrow issues.
So if actor A spends resources R_c on computation C, any other actor B can surrender resources equal to R_c to prevent computation C from happening. The surrendered resources and the original resources are then physically destroyed (e.g. spent on Bitcoin mining or something).
This then at least means that to a first approximation, no actor has an incentive to destroy ELYSIUM itself in order to stop some computation inside it from happening, because they could just use their resources to stop the computation in the simulation instead. And many actors benefit from ELYSIUM, so there's a large incentive to protect it.
And since the interaction is negative sum (both parties lose resources from their personal utopias) there would be strong reasons to negotiate.
In addition to this there could be rule-based and AI-based protections to prevent unauthorized funny tricks with simulations. One rule could be a sort of "cosmic block" where you can just block some or all other Utopias from knowing about you outside of a specified set of tests ("is torture happening here", etc).
But the text that you link to does not suggest any mechanism, that would actually protect Steve
There is a baseline set of rules that exists for exactly this purpose, which I didn't want to go into detail on in that piece because it's extremely distracting from the main point. These rules are not necessarily made purely by humans, but could for example be the result of some kind of AI-assisted negotiation that happens at ELYSIUM Setup.
"There would also be certain baseline rules like “no unwanted torture, even if the torturer enjoys it”, and rules to prevent the use of personal utopias as weapons."
But I think you're correct that the system that implements anti-weaponization and the systems that implement extrapolated volitions are potentially pushing against each other. This is of course a tension that is present in human society as well, which is why we have police.
So basically the question is "how do you balance the power of generalized-police against the power of generalized-self-interest."
Now the whole point of having "Separate Individualized Utopias" is to reduce the need for police. In the real world, it does seem to be the case that extremely geographically isolated people don't need much in the way of police involvement. Most human conflicts are conflicts of proximity, crimes of opportunity, etc. It is rare that someone basically starts an intercontinental stalking vendetta against another person. And if you had the entire resources of police departments just dedicated to preventing that kind of crime, and they also had mind-reading tech for everyone, I don't think it would be a problem.
I think the more likely problem is that people will want to start haggling over what kind of universal rights they have over other people's utopias. Again, we see this in real life. E.g. "diverse" characters forced into every video game because a few people with a lot of leverage want to affect the entire universe.
So right now I don't have a fully satisfactory answer to how to fix this. It's clear to me that most human conflict can be transformed into a much easier negotiation over basically who gets how much money/general-purpose-resources. But the remaining parts could get messy.
This seems to only be a problem if the individual advocates have vastly more optimization power than the AIs that check for non-aggression. I don't think there's any reason for that to be the case.
In contemporary society we generally have the opposite problem (the state uses lawfare against individuals).
virtual is strictly better. No one wants his utopia constrained by the laws of physics
Well. Maybe.
Technically it doesn't matter whether Valdimir Putin is good or bad.
What matters is that he is small and weak, and yet he still controls the whole of Russia which is large and powerful and much more intelligent than him.
Yes, I think this objection captures something important.
I have proven that aligned AI must exist and also that it must be practically implementable.
But some kind of failure, i.e. a "near miss" on achieving a desired goal can happen even if success was possible.
I will address these near misses in future posts.
This objection doesn't affect my argument because I am arguing that an aligned, controllable team of AIs exists, not that every team of AIs is aligned and controllable.
If IQ 500 is a problem, then make them the same IQs as people in Russia who are actually as a matter of fact controlled by Vladimir Putin, and who cannot and do not spontaneously come up with inscrutable steganography.
I don't see how anyone could possibly argue with my definitions.
mathematical abstraction of an actual real-world ASI
But it's not that: it's a mathematical abstraction of a disembodied ASI that lacks any physical footprint.
The problem with this is that people use the word "superintelligence" without a precise definition. Clearly they mean some computational process. But nobody who uses the term colloquially defines it.
So, I will make the assertion that if a computational process achieves the best possible outcome for you, it is a superintelligence. I don't think anyone would disagree with that.
If you do, please state what other properties you think a "superintelligence" must have other than being a computational process achieves the best possible outcome.
I never said it had to be implemented by a state. That is not the claim: the claim is merely that such a function exists.
you can have an alignment problem without humans. E.g. two strawberries problem.
Decoherence means that different branches don't interfere with each other on macroscopic scales. That's just the way it works.
Superfluids/superconductors/lasers are still microscopic effects that only matter at the scale of atoms or at ultra-low temperature or both.
bringing QM into this is not helping. All these types of questions are completely generic QM questions and ultimately they come down to measure ||Psi>|²
humans have incoherent preferences
This isn't really a problem with alignment so there's no need to address it here. Alignment means the transmission of a preference ordering to an action sequence. Lacking a coherent preference ordering for states of the universe (or histories, for that matter) is not an alignment problem.
This line of reasoning is interesting and I think it deserves some empirical exploration, which could be done with modern LLMs and agents.
E.g. make a complicated process that generates a distribution of agents via RLHF on a variety of base models and a variety of RLHF datasets, and then test all those agents on some simple tests. Pick the best agents according to their mean average scores on those simple tests, and then vet them with some much more thorough tests.
ok that's a fair point, I'll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy.
e.g. if you're looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong.
But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc.
Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don't understand.
regrettably I have forgotten (or never knew) the proof but it is on Wikipedia
https://en.wikipedia.org/wiki/Log-normal_distribution
I suspect that it is some fairly low-grade integral/substitution trick
unspecialized AGI is going to be worse than specialized human.
when did I say that?
No human has a job as scribe
Yes, correct. But people have jobs as copywriters, secretaries, etc. People specialize, because that is the optimal way to get stuff done.
For an example of the crushing advantage of specialization, see this tweet about how a tiny LLM with specialized training for multiplication of large numbers is better at it than cutting-edge general purpose LLMs.
The gap between (3) and (2) is the advantage of specialization. Problem-solving is not a linear scale of goodness, it's an expanding cone where advances in some directions are irrelevant to other directions.
The gap between (1) and (2) - the difference between being best at getting power and actually having the most power - is the advantage of the incumbent. Powerful incumbents can be highly suboptimal and still win because of things like network effects, agglomerative effects, defender's advantage and so on.
There is also another gap here. It's the gap between making entities that are generically obedient, and making a power-structure that produces good outcomes. What is that gap? Well, entities can be generically obedient but still end up producing bad outcomes because of:
(a) coordination problems (see World War I)
(b) information problems (see things like the promotion of lobotomies or HRT for middle-aged women)
(c) political economy problems (see things like NIMBYism, banning plastic straws, TurboTax corruption)
Problems of type (a) happen when everyone wants a good outcome, but they can't coordinate on it and defection strategies are dominant so people get the bad Nash Equilibrium
Problems of type (b) happen when everyone obediently walks off a cliff together. Supporting things like HRT for middle-aged or drinking a glass of red wine per week women was backed by science, but the science was actually bunk. People like to copy each other and obedience makes this worse because dissenters are punished more. They're being disobedient, you see!
Problems of type (c) happen because a small group of people actually benefit from making the world worse, and it often turns out that that small group are the ones who get to decide whether to perpetuate that particular way of making the world worse!
it doesn’t make sense to talk about “aligning superintelligence”, but rather about “aligning civilization” (or some other entity which has the ability to control outcomes)
The key insight here is that
(1) "Entities which do in fact control outcomes"
and
(2) "Entities which are near-optimal at solving the specific problem of grabbing power and wielding it"
and
(3) "Entities which are good at correctly solving a broad range of information processing/optimization problems"
are three distinct sets of entities which the Yudkowsky/Bostrom/Russell paradigm of AI risk has smooshed into one ("The Godlike AI will be (3) so therefore it will be (2) so therefore it will be (1)!"). But reality may simply not work like that and if you look at the real world, (1), (2) and (3) are all distinct sets.
The Contrarian 'AI Alignment' Agenda
Overall Thesis: technical alignment is generally irrelevant to outcomes, and almost everyone in the AI Alignment field is stuck with this incorrect assumption, working on technical alignment of LLM models
(1) aligned superintelligence that is provably logically realizable [already proved]
(2) aligned superintelligence is not just logically but also physically realizable [TBD]
(3) ML interpretability/mechanistic interpretability cannot possibly be logically necessary for aligned superintelligence [TBD]
(4) ML interpretability/mechanistic interpretability cannot possibly be logically sufficient for aligned superintelligence [TBD]
(5) given certain minimal intelligence, minimal emulation ability of humans by AI (e.g. understands common-sense morality and cause and effect) and of AI by humans (humans can do multiplications etc) the internal details of AI models cannot possibly make a difference to the set of realizable good outcomes, though they can make a difference to the ease/efficiency of realizing them [TBD]
(6) given near-perfect or perfect technical alignment (=AI will do what the creators ask of it with correct intent) awful outcomes are Nash Equilibrium for rational agents [TBD]
(7) small or even large alignment deviations make no fundamental difference to outcomes - the boundary between good/bad is determined by game theory, mechanism design and initial conditions, and only by a satisficing condition on alignment fidelity which is below the level of alignment of current humans (and AIs) [TBD]
(8) There is no such thing as superintelligence anyway because intelligence factors into many specific expert systems rather than one all-encompassing general purpose thinker. No human has a job as a “thinker” - we are all quite specialized. Thus, it doesn’t make sense to talk about “aligning superintelligence”, but rather about “aligning civilization” (or some other entity which has the ability to control outcomes) [TBD]
one can't predict the branch one will end up in
yes one can - all of them!
we would not be able to observe effects of double-slit experiments
yes, but thanks to decoherence this generally doesn't affect macroscopic variables. Branches are causally independent once they have split.
It obviously considers only a single branch.
Thanks to decoherece, you can just ignore any type of interference and treat each branch as a single classical universe.
if it isn't logically impossible
Until I wrote this proof, it was a live possibility that aligned superintelligence is in fact logically impossible.
I'm doubtful that such an assumption holds
Why?
Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons
All cardinalities here are finite. The set of generically realizable states is a finite set because they each have a finite and bounded information content description (a list of instructions to realize that state, which is not greater in bits than the number of neurons in all the human brains on Earth).
How are you concluding that such a lookup table constitutes a superintelligence?
Isn't it enough that it achieves the best possible outcome? What other criteria do you want a "superintelligence" to have?
I'm not particularly sold on the idea of launching a powerful argmax search and then doing a bit of handwaving to fix it.
It's like if you wanted a childminder to look after your young child, and you set off an argmax search to find the argmax of a function that looks like (quality) / (cost) and then afterwards trying to sort out whether your results are somehow broken/goodhearted.
If your argmax search is over 20 local childminders then that's probably fine.
But if it's an argmax search over all possible states of matter occupying an 8 cubic meter volume then... uh yeah that's really dangerous.
argmax of steps 2)-7) over the set of all generically realizable states
Argmax search is dangerous. If you want something "constructive" I think you probably want to more carefully model the selection process.
This is not a problem for my argument. I am merely showing that any state reachable by humans, must also be reachable by AIs. It is fine if AIs can reach more states.