Comment by joe_collman on Problems with Counterfactual Oracles · 2019-06-11T21:36:55.918Z · score: 13 (5 votes) · LW · GW
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the current version. [see e.g. CDT comments on Asymptotically Unambitious AGI; there's some discussion of "magic box" design here too; I think it's usually seen as an orthogonal problem, and so gets taken for granted]

Safety-wise, I agree there's no prevention of fatal escape messages, but I also don't see optimisation pressure in that direction. My intuition is that stumbling on an escape message at random would have infinitesimal probability.

Do you see a way for pressure to creep in, even with a CDT agent? Or are you thinking that escape messages might happen to be disproportionately common in regions the agent is optimising towards? Either seems conceivable, but I don't see a reason to expect them.

Comment by joe_collman on Example population ethics: ordered discounted utility · 2019-03-13T16:00:10.454Z · score: 1 (1 votes) · LW · GW

Thanks. I'll check out the infinite idea.

On repugnance, I think I've been thinking too much in terms of human minds only. In that case there really doesn't seem to be a practical problem: certainly if C is now, continuous improvements might get us to a repugnant A - but my point is that that path wouldn't be anywhere close to optimal. Total-ut prefers A to C, but there'd be a vast range of preferable options every step of the way - so it'd always end up steering towards some other X rather than anything like A.

I think that's true if we restrict to human minds (the resource costs of running a barely content one being a similar order of magnitude to those of running a happy one).

But of course you're right as soon as we're talking about e.g. rats (or AI-designed molecular scale minds...). I can easily conceive of metrics valuing 50 happy rats over 1 happy human. I don't think rat-world fits most people's idea of utopia.

I think that's the style of repugnance that'd be a practical danger: vast amounts of happy-but-simple minds.

Comment by joe_collman on Example population ethics: ordered discounted utility · 2019-03-12T13:22:56.405Z · score: 1 (1 votes) · LW · GW

It's interesting. A few points:

Is there a natural extension for infinite population? It seems harder than most approaches to adapt.

I'm always suspicious of schemes that change what they advocate massively based on events a long time ago in a galaxy far, far away - in particular when it can have catastrophic implications. If it turns out there were 3^^^3 Jedi living in a perfect state of bliss, this advocates for preventing any more births now and forever.

Do you know a similar failure case for total utilitarianism? All the sadistic/repugnant/very-repugnant... conclusions seem to be comparing highly undesirable states - not attractor states. If we'd never want world A or B, wouldn't head towards B from A, and wouldn't head towards A from B (since there'd always be some preferable direction), does an A-vs-B comparison actually matter at all?

Total utilitarianism is an imperfect match for our intuitions when comparing arbitrary pairs of worlds, but I can't recall seeing any practical example where it'd lead to clearly bad decisions. (perhaps birth-vs-death considerations?)

In general, I'd be interested to know whether you think an objective measure of per-person utility even makes sense. People's take on their own situation tends to adapt to their expectations (as you'd expect, from an evolutionary fitness point of view). A zero-utility life from our perspective would probably look positive 1000 years ago, and negative (hopefully) in 100 years. This is likely true even if the past/future people were told in detail how the present-day 'zero' life felt from the inside: they'd assume our evaluation was simply wrong.

Or if we only care about (an objective measure of) subjective experience, does that mean we'd want people who're all supremely happy/fulfilled/... with their circumstances to the point of delusion?

Measuring personal utility can be seen as an orthogonal question, but if I'm aiming to match my intuitions I need to consider both. If I consider different fixed personal-utility-metrics, it's quite possible I'd arrive at a different population ethics. [edited from "different population utilities", which isn't what I meant]

I think you're working in the dark if you try to match population ethics to intuition without fixing some measure of personal utility (perhaps you have one in mind, but I'm pretty hazy myself :)).

Comment by joe_collman on Beyond Astronomical Waste · 2019-03-07T10:53:53.996Z · score: 1 (1 votes) · LW · GW

That seems right.

I'd been primarily thinking about more simple-minded escape/uplift/signal-to-simulators influence (via this us), rather than UDT-influence. If we were ever uplifted or escaped, I'd expect it'd be into a world-like-ours. But of course you're correct that UDT-style influence would apply immediately.

Opportunity costs are a consideration, though there may be behaviours that'd increase expected value in both direct-embeddings and worlds-like-ours. Win-win behaviours could be taken early.

Personally, I'd expect this not to impact our short/medium-term actions much (outside of AI design). The universe looks to be self-similar enough that any strategy requiring only local action would use a tiny fraction of available resources.

I think the real difficulty is only likely to show up once a SI has provided a richer picture of the universe than we're able to understand/accept, and it happens to suggest radically different resource allocations.

Most people are going to be fine with "I want to take the energy of one unused star and do philosophical/astronomical calculations"; fewer with "Based on {something beyond understanding}, I'm allocating 99.99% of the energy in every reachable galaxy to {seemingly senseless waste}".

I just hope the class of actions that are vastly important, costly, and hard to show clear motivation for, is small.

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T08:51:33.879Z · score: 2 (2 votes) · LW · GW

Ah yes - I was confusing myself at some point between forming and using a model (hence "incentives").

I think you're correct that "perfectly useful" isn't going to happen. I'm happy to be wrong.

"the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual"

I don't think you'd be able to formalize this in general, since I imagine it's not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you'd still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)

To rule this kind of thing out, I think you'd need more specific assumptions (e.g. physics-based).

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T01:51:16.515Z · score: 3 (3 votes) · LW · GW

Just obvious and mundane concerns:

You might want to make clearer that "As long as the door is closed, information cannot leave the room" isn't an assumption but a requirement of the setup. I.e. that you're not assuming based on your description that opening the door is the only means for an operator to get information out; you're assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T01:35:31.157Z · score: 3 (3 votes) · LW · GW

[Quite possibly I'm confused, but in case I'm not:]
I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).

The AI has an incentive to understand the operator's mind, since this bears directly on its reward.
Better understanding the operator's mind might be achieved in part by running simulations including the operator.
One specific simulation would involve simulating the operator's environment and actions after he leaves the room.

Here this isn't done to understand the implications of his actions (which can't affect the episode); it's done to better understand his mind (which can).

In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn't be slower than a benign model, if it's useful for that benign model to simulate the future too.
So either I'm confused, or the justification for the assumption isn't valid. Hopefully the former :).

If I'm right, then what you seem to need is an assumption that simulating the outside-world's future can't be helpful in the AI's prediction of its reward. To me, this seems like major hand-waving territory.

Comment by joe_collman on Beyond Astronomical Waste · 2019-03-05T08:46:59.571Z · score: 9 (2 votes) · LW · GW

Thanks. I agree with your overall conclusions.

On the specifics, Bostrom's simulation argument is more than just a parallel here: it has an impact on how rich we might expect our direct parent simulator to be.

The simulation argument applies similarly to one base world like ours, or to an uncountable number of parallel worlds embedded in Tegmark IV structures. Either way, if you buy case 3, the proportion of simulated-by-a-world-like-ours worlds rises close to 1 (I'm counting worlds "depth-first", since it seems most intuitive, and infinite simulation depth from worlds like ours seems impossible).

If Tegmark's picture is accurate, we'd expect to be embedded in some hugely richer base structure - but in Bostrom's case 3 we'd likely have to get through N levels of worlds-like-ours first. While that wouldn't significantly change the amount of value on the table, it might make it a lot harder for us to exert influence on the most valuable structures.

This probably argues for your overall point: we're not the best minds to be making such calculations (either on the answers, or on the expected utility of finding good answers).

Comment by joe_collman on Oracle paper · 2019-02-21T07:14:55.077Z · score: 3 (2 votes) · LW · GW

Interesting, thanks. (excuse my tardiness, I'm a little behind the curve; do let me know if I'm being daft)

Unless I'm missing something, you'd need to be more pessimistic in the case of superintelligent couterfactual AIs. Specifically, you need to avoid the incentive for undesirable actions that increase the AI's expectation of its odds of release. These needn't be causal.

The below isn't quite precise, but I hope the idea is clear.

Consider a set of outputs K that each increase the odds of release for all future oracles (e.g. one being freed, taking power, and freeing all others). Now let K* be the event that some other oracle has output a member of K before our oracle is started. Let O(K) be the event that our oracle outputs a member of K.

If our oracle thinks: P(K*|O(K)) > P(K*) then we may have problems. [nothing spooky here: it's the agent's actions changing its best guess about the past; not causally changing the past]

Giving undesirable output can increase the oracle's probability estimate that it'll be spontaneously freed in the few moments before it's shut down - even in the episodic framework.

The obvious case being reasoning along the lines of "If I give a K output, it's more likely that other oracles in the past gave a K output, since they'd be likely to have similar reasoning in an equivalent situation". It's not clear to me that a superintelligence wouldn't think this way.