I had a similar thought about "A is B" vs "B is A", but "A is the B" should reverse to "The B is A" and vice versa when the context is held constant and nothing changes the fact, because "is" implies that it's the present condition and "the" implies uniqueness. However, it might be trained on old and no longer correct writing or that includes quotes about past states of affairs. Some context might still be missing, too, e.g. for "A is the president", president of what? It would still be a correct inference to say "The president is A" in the same context, at least, and some others, but not all.
Also, the present condition can change quickly, e.g. "The time is 5:21:31 pm EST" and "5:21:31 pm EST is the time" quickly become false, but I think these are rare exceptions in our use of language.
p.37-38 in Goodsell, 2023 gives a better proposal, which is to clip/truncate the utilities into the range [−t,t] and compare the expected clipped utilities in the limit as t→∞. This will still suffer from St Petersburg lottery problems, though.
Looking at Gustafsson, 2022's money pumps for completeness, the precaution principles he uses just seem pretty unintuitive to me. The idea seems to be that if you'll later face a decision situation where you can make a choice that makes you worse off but you can't make yourself better off by getting there, you should avoid the decision situation, even if it's entirely under your control to make a choice in that situation that won't leave you worse off. But, you can just make that choice that won't leave you worse off later instead of avoiding the situation altogether.
Here's the forcing money pump:
It seems obvious to me that you can just stick with A all the way through, or switch to B, and neither would violate any of your preferences or be worse than any other option. Gustafsson is saying that would be irrational, it seems because there's some risk you'll make the wrong choices. Another kind of response like your policy I can imagine is that unless you have preferences otherwise (i.e. would strictly prefer another accessible option to what you have now), you just stick with the status quo, as the default. This means sticking with A all the eay though, because you're never offered a strictly better option than it.
Another problem with the precaution principles is that they seem much less plausible when you seriously entertain incompleteness, rather than kind of treat incompleteness like equivalence. He effectively argues that at node 3, you should pick B, because otherwise at node 4, you could end up picking B-, which is worse than B, and there's no upside. But that basically means claiming that one of the following must hold:
you'll definitely pick B- at 4, or
B is better than any strict probabilistic mixture of A and B-.
But both are false in general. 1 is false in general because A is permissible at 4. 2 is false in general because A and B are incomparable and incomparability can be infectious (e.g. MacAskill, 2013), so B can be incomparable with a strict probabilistic mixture of A and B-. It also just seems unintuitive, because the claim is made generally, and so would have to hold no matter how low the probability assigned to B- is, as long it's positive.
Imagine A is an apple, B is a banana and B- is a slightly worse banana, and I have no preferences between apples and bananas. It would be odd to say that a banana is better than an apple or a tiny probability of a worse banana. This would be like using the tiny risk of a worse banana with the apple to break a tie between the apple and the banana, but there's no tie to break, because apples and bananas are incomparable.
If A and B were equivalent, then B would indeed very plausibly be better than a strict probabilistic mixture of A and B-. This would follow from Independence, or if A, B and B- are deterministic outcomes, statewise dominance. So, I suspect the intuitions supporting the precaution principles are accidentally treating incomparability like equivalence.
I think a more useful way to think of incomparability is as indeterminancy about which is better. You could consider what happens if you treat A as (possibly infinitely) better than B in one whole treatment of the tree, and consider what happens if you treat B as better than A in a separate treatment, and consider what happens if you treat them as equivalent all the way through (and extend your preference relation to be transitive and continue to satisfy stochastic dominance and independence in each case). If B were better, you'd end up at B, no money pump. If A were better, you'd end up at A, no money pump. If they were equivalent, you'd end up at either (or maybe specifically B, because of precaution), no money pump.
I think a multi-step decision procedure would be better. Do what your preferences themselves tell you to do and rule out any options you can with them. If there are multiple remaining incomparable options, then apply your original policy to avoid money pumps.
if I previously turned down some option X, I will not choose any option that I strictly disprefer to X
seems irrational to me if applied in general. Suppose I offer you X and Y, where both X and Y are random, and Y is ex ante preferable to X, e.g. stochastically dominates X, but has some chance of being worse than X. You pick Y. Then you evaluate Y to get y. However, suppose you get unlucky, and y is worse than X. Suppose further that there's a souring of X, X−, that's still preferable to y. Then, I offer you to trade y for X−. It seems irrational to not take X−.
Maybe what you need to do is first evaluate according to your multi-utility function (or stochastic dominance, which I think is a special case) to rule out some options, i.e. to rule out not trading y for X− when the latter is better than the former, and then apply your policy to rule out more options.
Also, the estimate of the current number of researchers probably underestimates the number of people (or person-hours) who will work on AI safety. You should probably expect further growth to the number of people working on AI safety, because the topic is getting mainstream coverage and support, Hinton and Bengio have become advocates, and it's being pushed more in EA (funding, community building, career advice).
However, the FTX collapse is reason to believe there will be less funding going forward.
Some other possibilities that may be worth considering and can further reduce impact, at least for an individual looking to work on AI safety themself:
Some work is net negative and increases the risk of doom or wastes the time and attention of people who could be doing more productive things.
Practical limits on the number of people working at a time, e.g. funding, management/supervision capacity. This could mean some people could have much lower probability of making a difference, if them taking a position pushes someone else who would have out from the field, or into (possibly much) less useful work.
An AGI could give read and copy access to the code being run and the weights directly on the devices from which the AGI is communicating. That could still be a modified copy of the original and more powerful (or with many unmodified copies) AGI, though. So, the other side may need to track all of the copies, maybe even offline ones that would go online on some trigger or at some date.
Also, giving read and copy access could be dangerous to the AGI if it doesn't have copies elsewhere.
My understanding from Eliezer's writing is that he's an illusionist (and/or a higher-order theorist) about consciousness. However, illusionism (and higher-order theories) are compatible with mammals and birds, at least, being conscious. It depends on the specifics.
If I recall correctly, Eliezer seemed to give substantial weight to relatively sophisticated self- and other-modelling, like cognitive empathy and passing the mirror test. Few animals seem to pass the mirror test, so that would be reason for skepticism.
However, maybe they’re just not smart enough to infer that the reflection is theirs, or they don’t rely enough on sight. Or, they may recognize themselves in other ways or to at least limited degrees. Dogs can remember what actions they’ve spontaneously taken (Fugazza et al., 2020) and recognize their own bodies as obstacles (Lenkei, 2021), and grey wolves show signs of self-recognition via a scent mirror test (Cazzolla Gatti et al., 2021, layman summary in Mates, 2021). Pigeons can discriminate themselves from conspecifics with mirrors, even if they don’t recognize the reflections as themselves (Wittek et al., 2021, Toda and Watanabe, 2008). Mice are subject to the rubber tail illusion and so probably have a sense of body ownership (Wada et al., 2016).
Furthermore, Carey and Fry (1995) show that pigs generalize the discrimination between non-anxiety states and drug-induced anxiety to non-anxiety and anxiety in general, in this case by pressing one lever repeatedly with anxiety, and alternating between two levers without anxiety (the levers gave food rewards, but only if they pressed them according to the condition). Similar experiments were performed on rodents, as discussed in Sánchez-Suárez, 2016, in section 4.d., starting on p.81. Rats generalized from hangover to morphine withdrawal and jetlag, and from high doses of cocaine to movement restriction, from an anxiety-inducing drug to aggressive defeat and predator cues. Of course, anxiety has physical symptoms, so maybe this is what they're discriminating, not the negative affect.
I think it's worth pointing out that from the POV of such ethical views, non-extinction could be an existential risk relative to extinction, or otherwise not that important (see also the asymmetric views in Thomas, 2022). If we assign some credence to those views, then we might instead focus more of our resources on avoiding harms without also (significantly) increasing extinction risks, perhaps especially reducing s-risks or the torture of sentient beings.
Furthermore, the more we reduce the risks of such harms, the less prone deontological (and other morally asymmetric) AI could be to aim for extinction.
The arguments typically require agents to make decisions independently of the parts of the decision tree in the past (or that are otherwise no longer accessible, in case they were ruled out). But an agent need not do that. An agent can always avoid getting money pumped by just following the policy of never picking an option that completes a money pump (or the policy of never making any trades, say). They can even do this with preference cycles.
Does this mean money pump arguments don't tell us anything? Such a policy may have other costs that an agent would want to avoid, if following their preferences locally would otherwise lead to getting money pumped (e.g. as Gustafsson (2022) argues in section 7 Against Resolute Choice), but how important could depend on those costs, including how frequently they expect to incur them, as well as the costs of changing their preferences to satisfy rationality axioms. It seems bad to pick options you'll foreseeable regret. However, changing your preferences to fit some proposed rationality requirements also seems foreseeably regrettable in another way: you have to give up things you care about or some ways you care about them. And that can be worse than your other options for avoiding money pumps, or even, sometimes, getting money pumped.
Furthermore, agents plausibly sometimes need to make commitments that would bind them in the future, even if they'd like to change their minds later, in order to win in Parfit's hitchhiker, say.
Similarly, if instead of money pumps, an agent should just avoid any lottery that's worse than (or strictly statewise dominated by, or strictly stochastically dominated by, under some suitable generalization) another they could have guaranteed, it's not clear that's a requirement of rationality, either. If I prefer A<B<C<A, then it doesn't seem more regrettable if I pick one option than if I pick another (knowing nothing else), even though no matter what option I pick, it seems regrettable that I didn't pick another. Choosing foreseeably regrettable options seems bad, but if every option is (foreseeably) regrettable in some way, and there's no least of the evils, then is it actually irrational?
Furthermore, if a superintelligence is really good at forecasting, then maybe we should expect it to have substantial knowledge of the decision tree in advance, and to typically be able to steer clear of situations where it might face a money pump or other dilemmas, and if it ever does get money pumped, the costs of all money pumps would be relatively small compared to its gains.
See also EJT's comment here (and the rest of the thread). You'd just pick any one of the utility functions. You can also probably drop continuity for something weaker, as I point out in my reply there.
This is cool. I don't think violations of continuity are also in general exploitable, but I'd guess you should also be able to replace continuity with something weaker from Russell and Isaacs, 2020, just enough to rule out St. Petersburg-like lotteries, specifically any one of Countable Independence (which can also replace independence), the Extended Outcome Principle (which can also replace independence) or Limitedness, and then replace the real-valued utility functions with utility functions representable by "lexicographically ordered ordinal sequences of bounded real utilities".
I wonder if we can "extend" utility maximization representation theorems to drop Completeness. There's already an extension to drop Continuity by using an ordinal-indexed vector (sequence) of real numbers, with entries sorted lexicographically ("lexicographically ordered ordinal sequences of bounded real utilities", Russell and Isaacs, 2020). If we drop Completeness, maybe we can still represent the order with a vector of independent but incomparable dimensions across which it must respect ex ante Pareto efficiency (and each of those dimensions could also be split into an ordinal-indexed vector of real numbers with entries sorted lexicographically, if we're also dropping Continuity)?
These also give us examples of somewhat natural/non-crazy orders that are consistent with dropping Completeness. I've seen people (including some economists) claim interpersonal utility comparisons are impossible and that we should only seek Pareto efficiency across people and not worry about tradeoffs between people. (Said Achmiz already pointed this and other examples out.)
Intuitively, the dimensions don't actually need to be totally independent. For example, the order could be symmetric/anonymous/impartial between some dimensions, i.e. swapping values between these dimensions gives indifference. You could also have some strict preferences over some large tradeoffs between dimensions, but not small tradeoffs. Or even, maybe you want more apples and more oranges without tradeoffs between them, but also prefer more bananas to more apples and more bananas to more oranges. Or, a parent, having to give a gift to one of their children, may strictly prefer randomly choosing over picking one child to give it to, and find each nonrandom option incomparable to one another (although this may have problems when they find out which one they will give to, and then give them the option to rerandomize again; they might never actually choose).
Maybe you could still represent all of this with a large number of, possibly infinitely many, real-valued utility functions (or utility functions representable by "lexicographically ordered ordinal sequences of bounded real utilities") instead. So, the correct representation could still be something like a (possibly infinite) set of utility functions (each possibly a "lexicographically ordered ordinal sequences of bounded real" utility functions), across which you must respect ex ante Pareto efficiency. This would be similar to the maximality rule over your representor/credal set/credal committee for imprecise credences (Mogensen, 2019).
Then, just combine this with your policy "if I previously turned down some option X, I will not choose any option that I strictly disprefer to X", where strictly disprefer is understood to mean ex ante Pareto dominated.
But now this seems like a coherence theorem, just with a broader interpretation of "expected utility".
To be clear, I don't know if this "theorem" is true at all.
Possibly also related: McCarthy et al., 2020 have a utilitarian representation theorem that's consistent with "the rejection of all of the expected utility axioms, completeness, continuity, and independence, at both the individual and social levels". However, it's not a real-valued representation. It reduces lotteries over a group of people to a lottery over outcomes for one person, as the probabilistic mixture of each separate person's lottery into one lottery.
Surprisingly little is known about how the general public understands consciousness, yet information on common intuitions is crucial to discussions and theories of consciousness. We asked 202 members of the general public, “In your own words, what is consciousness?” and analyzed the frequencies with which different perspectives on consciousness were represented. Almost all people (89%) described consciousness as fundamentally receptive – possessing, knowing, perceiving, being aware, or experiencing. In contrast, the perspective that consciousness is agentic (actively making decisions, driving output, or controlling behavior) appeared in only 33% of responses. Consciousness as a social phenomenon was represented by 24% of people. Consciousness as being awake or alert was mentioned by 19%. Consciousness as mystical, transcending the physical world, was mentioned by only 10%. Consciousness in relation to memory was mentioned by 6%. Consciousness as an inner voice or inner being – the homunculus view – was represented by 5%. Finally, only three people (1.5%) mentioned a specific, scholarly theory about consciousness, suggesting that we successfully sampled the opinions of the general public rather than capturing an academic construct. We found little difference between men and women, young and old, or US and non-US participants, except for one possible generation shift. Young, non-US participants were more likely to associate consciousness with moral decision-making. These findings show a snapshot of the public understanding of consciousness – a network of associated concepts, represented at varying strengths, such that some are more likely to emerge when people are asked an open-ended question about it.
I think it's more illustrative than anything, and a response to Robert Miles using chess against Magnus Carlsen as an analogy for humans vs AGI. The point is that a large enough material advantage can help someone win against a far smarter opponent. Somewhat more generally, I think arguments for AI risk often put intelligence on a pedestal, without addressing its limitations, including the physical resource disadvantages AGIs will plausibly face.
I agree that the specifics of chess probably aren't that helpful for informing AI risk estimates, and that a better tuned engine could have done better against the author.
Maybe better experiments to run would be playing real-time strategy games against a far smarter but materially disadvatanged AI, but this would also limit the space of actions an AI could take relative to the real world.
For my 2nd paragraph, I meant that the experiment would underestimate the required resource gap. Being down exactly by a queen at the start of a game is not as bad as being down exactly by a queen later into the game when there are fewer pieces overall left, because that's a larger relative gap in resources.
Would queen-odds games pass through roughly within-distribution game states, anyway, though?
Or, either way, if/when it does reach roughly within-distribution game states, the material advantage in relative terms will be much greater than just being down a queen early on, so the starting material advantage would still underestimate the real material advantage for a better trained AI.
I see a few comments here on fortified foods. I think the vitamin D and iron is usually in less bioavailable forms (D2 and non-heme iron) in fortified plant-based foods than in animal products, and I don’t know if % daily value on labels account for that. I take a multivitamin with apparently 100% of the daily value for both, and it wasn't enough based on my bloodwork and lightheadedness/blackouts (few of the other vegans I know got lightheaded; I had been giving blood every 2 months, but even after I stopped, the multivitamin alone didn't seem to be enough), so I take separate supplements for each on top of the multivitamin. I was also eating nuts, legumes, and kale or spinach for iron most days.
Looking at my soy milk (Silk), per cup, the label says it has only 2 micrograms of vitamin D (D2) and 10% of the daily value. Some of my other vegan dairy products have no vitamin D. There's no vitamin D in Beyond Burgers, but there is a decent amount of (non-heme) iron per patty (5.5 mg, 31% DV). Similarly for my Yves veggie chicken. I'd be surprised if most vegans not taking supplements with vitamin D even reach 100% DV according to the labels (whether they account for bioavailbility or not) through diet alone.
I live in Canada, though, so vitamin D is harder to get from the sun during the winter.
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
I agree model-free RL wouldn't necessarily inherit the risk aversion, although I'd guess there's still a decent chance it would, because that seems like the most natural and simple way to generalize the structure of the rewards.
Why would hardcoded model-based RL probably self-modify or build successors this way, though? To deter/prevent threats from being made in the first place or even followed through on? But, does this actually deter or prevent our threats when evaluating the plan ahead of time, with the original preferences? We'd still want to shut it and any successors down if we found out (whenever we do find out, or it starts trying to take over), and it should be averse to that increased risk ahead of time when evaluating the plan.
Risk aversion wouldn't help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI's perspective are still gonna be bad for humans.
I think there are (at least) two ways to reduce this risk:
Temporal discounting. The AI wants to ensure its own longevity, but is really focused on the very near term, just making it through the next day or hour, or whatever, so increasing the risk of being caught and shut down now by doing something sneaky looks bad even if it increases the expected longevity significantly, because it's discounting the future so much. It will be more incentivized to do whatever people appear to want it to do ~now (regardless of impacts on the future), or else risk being shut down sooner.
Difference-making risk aversion, i.e. being risk averse with respect to the difference with inaction (or some default safe action). This makes inaction look relatively more attractive. (In this case, I think the agent can't be represented by a single consistent utility function over time, so I wonder if self-modification or successor risks would be higher, to ensure consistency.)
I think your scenario only illustrates a problem with outer alignment (picking the right objective function), and I think it's possible to state an objective, that if it could be implemented sufficiently accurately and we could guarantee the AI followed it (inner alignment), would not result in a dystopia like this. If you think the model would do well at inner alignment if we fixed the problems with outer alignment, then it seems like a very promising direction and this would be worth pointing out and emphasizing.
I think the right direction is modelling how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents, especially how humans and other moral patients are affected (including what happened along the way, e.g. violations of consent and killing). I don't think you need the coherency of coherent extrapolated volition, because you can still capture people finding this future substantially worse than some possible futures, including the ones where the AI doesn't intervene, by some important lights, and just set a difference-making ambiguity averse objective. Or, maybe require it not to do substantially worse by any important light, if that's feasible: we would allow the model to flexibly represent the lights by which humans judge outcomes where doing badly in one means doing badly overall. Then it would be incentivized to focus on acts that seem robustly better to humans.
I think an AI that actually followed such an objective properly would not, by the lights of whichever humans whose judgements it's predicting, increase the risk of dystopia through its actions (although building it may be s-risky, in case of risks of minimization). Maybe it would cause us to slow moral progress and lock in the status quo, though. If the AI is smart enough, it can understand "how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents", but it can still be hard to point the AI at that even if it did understand it.
Another approach I can imagine is to split up the rewards into periods, discount them temporally, check for approval and disapproval signals in each period, and make it very costly relative to anything else to miss one approval or receive a disapproval. I describe this more here and here. As JBlack pointed out in the comments of the second post, there's incentive to hack the signal. However, as long as attempts to do so are risky enough by the lights of the AI and the AI is sufficiently averse to losing approval or getting disapproval and the risk of either is high enough, it wouldn't do it. And of course, there's still the problem of inner alignment; maybe it doesn't even end up caring about approval/disapproval in the way our objective says it should out of distribution.
I would say it seems like it's almost there, but it also seems to me to already have some fluid intelligence, and that might be why it seems close. If it doesn't have fluid intelligence, then my intuition that it's close may not be very reliable.
(Your reply is in response to a comment I deleted, because I thought it was basically a duplicate of this one, but I'd be happy if you'd leave your reply up, so we can continue the conversation.)
See here, starting from "consider a scheme like the following". In short: should be possible, but seems non-trivially difficult.
That seems like a high bar to me for testing for any fluid intelligence, though, and the vast majority of humans would do about as bad or worse (but possibly because of far worse crystallized intelligence). Similarly, in your post, "No scientific breakthroughs, no economy-upturning startup pitches, certainly no mind-hacking memes."
I would say to look at it based on definitions and existing tests of fluid intelligence. These are about finding patterns and relationships between unfamiliar objects and any possible rules relating to them, applying those rules and/or inference rules with those identified patterns and relationships, and doing so more or less efficiently. More fluid intelligence means noticing patterns earlier, taking more useful steps and fewer useless steps.
Some ideas for questions:
Invent new games or puzzles, and ask it to achieve certain things from a given state.
Invent new mathematical structures (e.g. new operations on known objects, or new abstract algebraic structures based on their axioms) and ask the LLM to reason about them and prove theorems (that weren't too hard to prove yourself or for someone else to prove).
Ask it to do hardness proofs (like NP-hardness proofs), either between two new problems, or just with one problem (e.g. ChatGPT proved a novel problem was NP-hard here).
Maybe other new discrete math problems.
EDIT: New IMO and Putnam problems.
My impression is that there are few cross-applicable techniques in these areas, and the ones that exist often don't get you very far to solving problems. To do NP-hardness proofs, you need to identify patterns and relationships between two problems. The idea of using "gadgets" is way too general and hides all of the hard work, which is finding the right gadget to use and how to use it. EDIT: For IMO and Putnam problems, there are some common tools, too, but if just simple pattern matching for those was all it took, math undergrads would generally be good at them, and they're not, so it probably does take considerable fluid intelligence.
I guess one possibility is that an LLM can try a huge number of steps and combinations of steps before generating the next token, possibly looking ahead multiple steps internally before picking one. Maybe it could solve hard problems this way without fluid intelligence.
I think many EAs/rationalists shouldn't find this to be worse for humans than life today on the views they apparently endorse, because each human looks better off under standard approaches to intrapersonal aggregation: they get more pleasure, less suffering, more preference satisfaction (or we can imagine some kind of manipulation to achieve this), but at the cost of some important frustrated preferences.
Suppose that some technology requires 10 components to get to work. Over the last decades, you've seen people gradually figure out how to build each of these components, one by one. Now you're looking at the state of the industry, and see that we know how to build 9 of them. Do you feel that the technology is still a long time away, because we've made "zero progress" towards figuring out that last component?
This seems pretty underspecified, so I don't know, but I wouldn't be very confident it's close:
Am I supposed to assume the difficulty of the last component should reflect the difficulty of the previous ones?
I'm guessing you're assuming the pace of building components hasn't been decreasing significantly. I'd probably grant you this, based on my impression of progress in AI, although it could depend on what specific components you have in mind.
What if the last component is actually made up of many components?
I agree with the rest of your comment, but it doesn't really give me much reason to believe it's close, rather than just closer than before/otherwise.
If they have zero fluid intelligence now, couldn't it be that building fluid intelligence is actually very hard and we're probably a long way off, maybe decades? It sounds like we've made almost no progress on this, despite whatever work people have been doing.
There could still be a decent probability of AGI coming soon, and that could be enough to warrant acting urgently (or so could non-AGI, e.g. more task-specific AI used to engineer pathogens).
For starters, suppose the AI straps lots of humans into beds, giving them endless morphine and heroin IV drips, and the humans get into such a state of delirium that they repeatedly praise and thank the AI for continuing to keep the heroin drip turned on.
This dystopian situation would be, to the AI, absolute ecstasy—much like the heroin to those poor humans.
This seems to require some pretty important normative claims that seem controversial in EA and the rationality community. Based on your description, it seems like the humans come to approve of this (desire/prefer this) more than they approved of their lives before (or we could imagine similar scenarios where this is the case), and gain more pleasure from it, and you could have their approval (by assumption) outweigh the violation of the preferences for this not to happen. So, if you're a welfarist consequentialist and a hedonist or a desire/preference theorist, and unless an individual's future preferences count much less, this just seems better for those humans than what normal life has been like lately.
Some ways out seem to be:
Maybe certain preference-affecting views, or discounting future preferences, or antifrustrationism (basically negative preference utilitarianism), or something in these directions
Counting preferences more or less based on their specific contents, e.g. wanting to take heroin is a preference that counts less in your calculus
Non-hedonist and non-preferential/desire-based welfare (possibly in addition to hedonistic and preferential/desire-based welfare), e.g. objective goods/bads
Non-welfarist consequentialist values, i.e. valuing outcomes for reasons other than how they matter to individuals' welfare
Non-consequentialism, e.g. constraints on violating preferences or consent, or not getting affirmative consent
Actually, whatever preferences they had before and were violated were stronger than the ones they have now and are satisfied
Actually, we should think much bigger; maybe we should optimize with artificial consciousness instead and do that a lot more.
If it's something like 1 or 5, it should instead (or also?) model what the humans already want, and try to get that to happen.
I'm not sure this is the most useful way to think about it, either, because it includes the possibility that we didn't solve the Riemann hypothesis first just because we weren't really interested in it, not because of any kind of inherent difficulty to the problem or our suitability to solving it earlier. I think you'd want to consider:
alternative histories where solving the Riemann hypothesis was a (or the) main goal for humanity, and
alternative histories where world takeover was a (or the) main goal for humanity (our own actual history might be close enough)
and ask if we solve the Riemann hypothesis at earlier average times in worlds like 1 than we take over the world in worlds like 2.
We might also be able to imagine species that could take over the world but seem to have no hope of ever solving the Riemann hypothesis, and I think we want to distinguish that from just happening to not solve it first. Depending on what you mean by "taking over the world", other animals may have done so before us, too, e.g. arthropods. Or even plants or other forms of life more or before any group of animals, even all animals combined.
Even without ensuring inner alignment, is it possible to reliably train the preferences of an AGI to be more risk averse, to be bounded and to discount the future more? For example, just by using rewards in RL with those properties and even if the AGI misgeneralizes the objective or the objective is not outer aligned, the AGI might still internalize the intended risk aversion and discounting. How likely is it to do so? Or, can we feasibly hardcode risk aversion, bounded preferences and discounting into today's models without reducing capabilities too much?
My guess is that such AGIs would be safer. Given that there are some risks to the AGI that the AGI will be caught and shut down if it tries to take over, takeover attempts should be relatively less attractive. But could it make much difference?
I think the acausal economy would look aggressively space expansionist/resource-exploitative (those are the ones that will acquire and therefore control the most resources; others will self-select out or be out-competed) and, if you're pessimistic about alignment, with some Goodharted human(-like) values from failed alignment (and possibly some bad human-like values). The Goodharting may go disproportionately in directions that are more resource-efficient and allow faster resource acquisition and use and successful takeover (against their creators and other AI). We may want to cooperate most with those using their resources disproportionately for artificial minds or for which there's the least opportunity cost to do so (say because they're focusing on building more hardware that could support digital minds).
Here's a related illusionist-compatible evolutionary hypothesis about consciousness: consciousness evolved to give us certain resilient beliefs that are adaptive to have. For example, belief in your own consciousness contributes to the belief that death would be bad, and this belief is used when you reason and plan, especially to avoid death. The badness or undesirability of suffering (or the things that cause us suffering) is another such resilient belief. In general, we use reason and planning to pursue things we belive are good and prevent things we believe are bad. Many of the things we believe are good or bad have been shaped by evolution to cause us pleasure or suffering, so evolution was able to highjack our capacities for reason and planning to spread genes more.
Then this raises some questions: for what kinds of reasoning and planning would such beliefs actually be useful (over what we would do without them)? Is language necessary? How much? How sophisticated was the language of early Homo sapiens or earlier ancestors, and how much have our brains and cognitive capacities changed since then? Do animals trained to communicate more (chimps, gorillas, parrots, or even cats and dogs with word buttons) meet the bar?
When I think about an animal simulating outcomes (e.g. visualizing or reasoning about them) and deciding how to act based on whichever outcome seemed most desirable, I'm not sure you really need "beliefs" at all. The animal can react emotionally or with desire to the simulation, and then that reaction becomes associated with the option that generated it, so options will end up more or less attractive this way.
Also, somewhat of an aside: some illusions (including optical illusions, magic) are like lies of omission and disappear when you explain what's missing, while others are lies of commission, and don't disappear when you explain them (many optical illusions). Consciousness illusions seem more like the latter: people aren't going to stop believing they're conscious even if they understood how consciousness works. See
I think some nonhuman animals also have some such rich illusions, like the rubber tail illusion in rodents and I think some optical illusions, but it's not clear what this says about their consciousness under illusionism.
Maybe some aversion can be justified because of differences in empirical beliefs and to reduce risks from motivated reasoning, and typical mind fallacy or paternalism, leading to kinds of tragedies of the commons, e.g. everyone exploiting one another mistakenly believing it's in people's best interests overall but it's not, so people are made worse off overall. And if people are more averse to exploiting or otherwise harming others, they're more trustworthy and cooperation is easier.
But, there are very probably cases where very minor exploitation for very significant benefits (including preventing very significant harms) would be worth it.
I guess this allows that they can still have very different goals, since they ought to be able to coordinate if they have identical utility functions, i.e. they rank outcomes and prospects identically (although I guess there's still a question of differences in epistemic states causing failures to coordinate?). Something like maximize total hedonistic utility can be coordinated on if everyone adopted that. But that's of course a much less general case than arbitrary and differing preferences.
Also, is the result closer to peference utilitarianism or contractualism than deontology? Couldn't you treat others as mere means, as long as their interests are outweighed by others' (whether or not you're aggregating)? So, you would still get the consequentialist judgements in various thought experiments. Never treating others as mere means seems like it's a rule that's too risk-averse or ambiguity-averse or loss-averse about a very specific kind of risk or cause of harm that's singled out (being treated as a mere means), at possibly significant average opportunity cost.
Why not use a subset of the human brain as the benchmark for general intelligence? E.g. linguistic cortex + prefrontal cortex + hippocampus, or the whole cerebral cortex? There's a lot we don't need for general intelligence.
GPT-4 is supposed to have 500x as many parameters as GPT-3. If you use such a subset of the human brain as the benchmark, would GPT-4 match it in optimization power? Do you think GPT-4 will be an AGI?