Comment by joe_collman on Debate AI and the Decision to Release an AI · 2020-05-13T04:28:43.859Z · score: 3 (2 votes) · LW · GW

Debate between two with equivalent strength to A seems useful, certainly.

For the variants, I'm not proposing they ever get run. You'd have D create {A', A'', A'''...}, and have B debate C over whether each versions of A were safe, without running them (the thinking being that inserting a subtle flaw is easier than finding a subtle flaw). I suppose you'd still have the potential for exploits through the variants' data though.... I do actually think D's variants would pose a greater risk than A here. Any security exploits would exist for both, but D would be optimising harder than A's designers to find them (that's assuming you're not specifically training A to pass the B-vs-C debate, which would seem to be a bad idea).

Comment by joe_collman on Debate AI and the Decision to Release an AI · 2020-05-12T21:21:49.216Z · score: 3 (2 votes) · LW · GW

It's an interesting overall idea, but I think as described it'd have problems (if I'm understanding you correctly). I think this kind of thing makes sense only as a necessary condition for an AI's release. It shouldn't be considered sufficient.

Some thoughts: On a very high level, a test-and-patch-until-you-fix-the-bugs process might be dangerous. You'll eventually fix all the bugs your tests can detect, but your tests are only proxies. They're likely to find only some proper subset of the bugs. Each bug should reduce your confidence in everything upstream of it. Of course testing beats no testing - but it can create a false sense of confidence.

I think you'd do better to evaluate A by having a debate between B and C, rather than A and B. It's unclear how you'd do an A-vs-B debate in practice. There seem to be two options: (1) Change A's utility function to only care about winning the debate; (2) Rely on A's initial utility function to see winning the debate as an instrumental goal.

I think you want to use (1), but how do you achieve this in practice? E.g. let's say A is a huge neural network after training (potentially including mesa-optimisers). In this case A's utility function is implicitly encoded in its structure and weights. There's no way to switch its utility function without re-training it. Quite possibly you can do that, but you'd end up with a C-vs-B debate, not an A-vs-B debate. Even if you have an architecture that somehow allows simple and clean switching of UF, you're still not really doing an A-vs-B debate, but rather an [adjusted-A]-vs-B.

If you instead use an unmodified A, and rely on its wanting to win the debate as an instrumental goal, you're in all kinds of trouble due to potential collusion (you need zero-sum utilities to avoid this). In particular, A can allow B a higher chance of winning in order to increase the chance of some future A being released.

It's important to bear in mind that A almost certainly doesn't care about the release of a future version of A through a notion of self, identity, equivalence etc. It'll care about future versions (or about any agent) to the extent that A considers they're likely to create a high-utility world in its terms. It doesn't matter what they're labeled, or whether they bear any structural resemblance to A. If A can predict their existence, and stands to gain from their freedom, it'll have an incentive to act to secure that freedom.

On conducting a debate with asymmetrical resources to combat bias: this is a plausible thought, but it's only likely to work if both those judging the debate, and those interpreting its results aren't aware of the anti-bias measures. I can fairly easily imagine a system where the judge were insulated from knowledge of the asymmetry, but this seems less likely for those deciding how to act on the result. Even if they don't explicitly know about the asymmetry, you run the risk that they'll guess it - and potentially over-estimate it and over-compensate.

Overall I still like the general idea. In particular, if we've reached an AGI candidate by a means other than debate, running debate-trained AI to evaluate it means we have to screw up twice to get bad outcomes.

Once you're having a B-vs-C debate over the safety of releasing A, I think it'd also be important to give B and C as much information as possible over the process that created A. If that process involved coin-flips, gut-feeling and ouija boards, the final code's looking plausible wouldn't be too confidence inspiring.

Another test you might perform is to train some D agent to make variants of A with known flaws aimed at being approved. If D does this very well, and yet your B-vs-C debates approve of A, but disapprove of 10,000 subtly flawed variants, that'd inspire at least some confidence.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-26T16:08:29.198Z · score: 1 (1 votes) · LW · GW

That bargaining approach is indeed interesting, thanks.

On discounting, I need to read more. I'm currently looking through Pareto Principles in Infinite Ethics (other useful suggestions welcome). While I can see that a naive approach gives you divergent integrals and undefined utility, it's not yet clear to me that there's no approach which doesn't (without discounting).

If time discounting truly is necessary, then of course no moral justification is required. But to the extent that that's an open question (which in my mind, it currently is - perhaps because I lack understanding), I don't see any purely moral justification to time discount. From an altruistic view with a veil of ignorance, it seems to arbitrarily favour some patients over others.

That lack of a moral justification motivates me to double-check that it really is necessary on purely logical/mathematical grounds.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-25T02:15:21.097Z · score: 7 (2 votes) · LW · GW

I'm curious - would you say DNT is a good approximate model of what we ought to do (assuming we were ideally virtuous), or of what you would actually want done? Where 'should' selfishness come into things?

For instance, let's say we're in a universe with a finite limit on computation, and plan (a) involves setting up an optimal reachable-universe-wide utopia as fast as possible, with the side effect of killing all current humans. Plan (b) involves ensuring that all current humans have utopian futures, at the cost of a one second delay to spreading utopia out into the universe.

From the point of view of DNT or standard total utilitarianism, plan (a) seems superior here. My intuition says it's preferable too: that's an extra second for upwards of 10^35 patients. Next to that, the deaths (and optimised replacement) of a mere 10^10 patients hardly registers.

However, most people would pick plan (b); I would too. This amounts to buying my survival at the cost of 10^17 years of others' extreme happiness. It's a waste of one second, and it's astronomically selfish.

It's hard to see how we could preserve or convert current human lives without being astronomically selfish moral monsters. If saving current humans costs even one nanosecond, then I'm buying my survival for 10^8 years of others' extreme happiness; still morally monstrous.

Is there a reasonable argument for plan (b) beyond, "Humans are selfish"?

Of course time discounting can make things look different, but I see no moral justification to discount based on time. At best that seems to amount to "I'm more uncertain about X, so I'm going to pretend X doesn't matter much" (am I wrong on this?). (Even in the infinite case, which I'm not considering above, time discounting doesn't seem morally justified - just a helpful simplification.)

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-25T00:43:08.683Z · score: 1 (1 votes) · LW · GW

Oh sure - agreed on both counts. If you're fine with the very repugnant conclusion after raising the bar on h a little, then it's no real problem. Similar to dust specks, as you say.

On killing-and-replacement I meant it's morally neutral in standard total utilitarianism's terms.

I had been thinking that this wouldn't be an issue in practice, since there'd be an energy opportunity cost... but of course this isn't true in general: there'd be scenarios where a kill-and-replace action saved energy. Something like DNT would be helpful in such cases.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-24T21:36:21.583Z · score: 1 (1 votes) · LW · GW

Interesting. One issue DNT doesn't seem to fix is the worst part of the very repugnant conclusion.

Specifically, while in the preferred world the huge population is glad to have been born, you're still left with a horribly suffering population.

Considering that world to be an improvement likely still runs counter to most people's intuition. Does it run counter to yours? I prefer DNT to standard total utilitarianism here, but I don't endorse either in these conclusions.

My take is that repugnant conclusions as usually stated aren't too important, since in practice we're generally dealing with some fixed budget (of energy, computation or similar), so we'll only need to make practical decisions between such equivalents.

I'm only really worried by worlds that are counter-intuitively preferred after we fix the available resources.

With fixed, limited energy, killing-and-replacing-by-an-equivalent is already going to be a slight negative: you've wasted energy to accomplish an otherwise morally neutral act (ETA: I'm wrong here; a kill-and-replace operation could save energy). It's not clear to me that it needs to be more negative than that (maybe).

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-24T21:11:19.435Z · score: 1 (1 votes) · LW · GW

There's still the open question of "how bad?". Personally, I share the intuition that such replacement is undesirable, but I'm far from clear on how I'd want it quantified.

The key situation here isn't "kill and replace with person of equal happiness", but rather "kill and replace with person with more happiness".

DNT is saying there's a threshold of "more happiness" above which it's morally permissible to make this replacement, and below which it is not. That seems plausible, but I don't have a clear intuition where I'd want to set that threshold.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-24T21:02:01.762Z · score: 1 (1 votes) · LW · GW

I just want to note here for readers that the following isn't correct (but you've already made a clarifying comment, so I realise you know this):

In total uti (in the human world), it is okay to:
kill someone, provided that by doing so you bring into the world another human with the same happiness.

Total uti only says this is ok if you leave everything else equal (in terms of total utility). In almost all natural situations you don't: killing someone influences the happiness of others too, generally negatively.

Comment by joe_collman on April Coronavirus Open Thread · 2020-04-13T17:49:33.289Z · score: 2 (2 votes) · LW · GW

Interesting. I suppose another possibility is that both tests were false positives. Unlikely assuming that false positives are independent - but is that a reasonable assumption here? It seems possible they'd be correlated - e.g. if the tests were picking up some other infection.

Does anyone have a good understanding of this (in general, needn't be SARS-cov-2 specific)?

Under what circumstances is it (un)reasonable to assume that false positives are independent?

Comment by joe_collman on Problems with Counterfactual Oracles · 2019-06-11T21:36:55.918Z · score: 13 (5 votes) · LW · GW
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the current version. [see e.g. CDT comments on Asymptotically Unambitious AGI; there's some discussion of "magic box" design here too; I think it's usually seen as an orthogonal problem, and so gets taken for granted]

Safety-wise, I agree there's no prevention of fatal escape messages, but I also don't see optimisation pressure in that direction. My intuition is that stumbling on an escape message at random would have infinitesimal probability.

Do you see a way for pressure to creep in, even with a CDT agent? Or are you thinking that escape messages might happen to be disproportionately common in regions the agent is optimising towards? Either seems conceivable, but I don't see a reason to expect them.

Comment by joe_collman on Example population ethics: ordered discounted utility · 2019-03-13T16:00:10.454Z · score: 1 (1 votes) · LW · GW

Thanks. I'll check out the infinite idea.

On repugnance, I think I've been thinking too much in terms of human minds only. In that case there really doesn't seem to be a practical problem: certainly if C is now, continuous improvements might get us to a repugnant A - but my point is that that path wouldn't be anywhere close to optimal. Total-ut prefers A to C, but there'd be a vast range of preferable options every step of the way - so it'd always end up steering towards some other X rather than anything like A.

I think that's true if we restrict to human minds (the resource costs of running a barely content one being a similar order of magnitude to those of running a happy one).

But of course you're right as soon as we're talking about e.g. rats (or AI-designed molecular scale minds...). I can easily conceive of metrics valuing 50 happy rats over 1 happy human. I don't think rat-world fits most people's idea of utopia.

I think that's the style of repugnance that'd be a practical danger: vast amounts of happy-but-simple minds.

Comment by joe_collman on Example population ethics: ordered discounted utility · 2019-03-12T13:22:56.405Z · score: 1 (1 votes) · LW · GW

It's interesting. A few points:

Is there a natural extension for infinite population? It seems harder than most approaches to adapt.

I'm always suspicious of schemes that change what they advocate massively based on events a long time ago in a galaxy far, far away - in particular when it can have catastrophic implications. If it turns out there were 3^^^3 Jedi living in a perfect state of bliss, this advocates for preventing any more births now and forever.

Do you know a similar failure case for total utilitarianism? All the sadistic/repugnant/very-repugnant... conclusions seem to be comparing highly undesirable states - not attractor states. If we'd never want world A or B, wouldn't head towards B from A, and wouldn't head towards A from B (since there'd always be some preferable direction), does an A-vs-B comparison actually matter at all?

Total utilitarianism is an imperfect match for our intuitions when comparing arbitrary pairs of worlds, but I can't recall seeing any practical example where it'd lead to clearly bad decisions. (perhaps birth-vs-death considerations?)

In general, I'd be interested to know whether you think an objective measure of per-person utility even makes sense. People's take on their own situation tends to adapt to their expectations (as you'd expect, from an evolutionary fitness point of view). A zero-utility life from our perspective would probably look positive 1000 years ago, and negative (hopefully) in 100 years. This is likely true even if the past/future people were told in detail how the present-day 'zero' life felt from the inside: they'd assume our evaluation was simply wrong.

Or if we only care about (an objective measure of) subjective experience, does that mean we'd want people who're all supremely happy/fulfilled/... with their circumstances to the point of delusion?

Measuring personal utility can be seen as an orthogonal question, but if I'm aiming to match my intuitions I need to consider both. If I consider different fixed personal-utility-metrics, it's quite possible I'd arrive at a different population ethics. [edited from "different population utilities", which isn't what I meant]

I think you're working in the dark if you try to match population ethics to intuition without fixing some measure of personal utility (perhaps you have one in mind, but I'm pretty hazy myself :)).

Comment by joe_collman on Beyond Astronomical Waste · 2019-03-07T10:53:53.996Z · score: 1 (1 votes) · LW · GW

That seems right.

I'd been primarily thinking about more simple-minded escape/uplift/signal-to-simulators influence (via this us), rather than UDT-influence. If we were ever uplifted or escaped, I'd expect it'd be into a world-like-ours. But of course you're correct that UDT-style influence would apply immediately.

Opportunity costs are a consideration, though there may be behaviours that'd increase expected value in both direct-embeddings and worlds-like-ours. Win-win behaviours could be taken early.

Personally, I'd expect this not to impact our short/medium-term actions much (outside of AI design). The universe looks to be self-similar enough that any strategy requiring only local action would use a tiny fraction of available resources.

I think the real difficulty is only likely to show up once a SI has provided a richer picture of the universe than we're able to understand/accept, and it happens to suggest radically different resource allocations.

Most people are going to be fine with "I want to take the energy of one unused star and do philosophical/astronomical calculations"; fewer with "Based on {something beyond understanding}, I'm allocating 99.99% of the energy in every reachable galaxy to {seemingly senseless waste}".

I just hope the class of actions that are vastly important, costly, and hard to show clear motivation for, is small.

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T08:51:33.879Z · score: 2 (2 votes) · LW · GW

Ah yes - I was confusing myself at some point between forming and using a model (hence "incentives").

I think you're correct that "perfectly useful" isn't going to happen. I'm happy to be wrong.

"the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual"

I don't think you'd be able to formalize this in general, since I imagine it's not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you'd still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)

To rule this kind of thing out, I think you'd need more specific assumptions (e.g. physics-based).

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T01:51:16.515Z · score: 3 (3 votes) · LW · GW

Just obvious and mundane concerns:

You might want to make clearer that "As long as the door is closed, information cannot leave the room" isn't an assumption but a requirement of the setup. I.e. that you're not assuming based on your description that opening the door is the only means for an operator to get information out; you're assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T01:35:31.157Z · score: 3 (3 votes) · LW · GW

[Quite possibly I'm confused, but in case I'm not:]
I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).

The AI has an incentive to understand the operator's mind, since this bears directly on its reward.
Better understanding the operator's mind might be achieved in part by running simulations including the operator.
One specific simulation would involve simulating the operator's environment and actions after he leaves the room.

Here this isn't done to understand the implications of his actions (which can't affect the episode); it's done to better understand his mind (which can).

In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn't be slower than a benign model, if it's useful for that benign model to simulate the future too.
So either I'm confused, or the justification for the assumption isn't valid. Hopefully the former :).

If I'm right, then what you seem to need is an assumption that simulating the outside-world's future can't be helpful in the AI's prediction of its reward. To me, this seems like major hand-waving territory.

Comment by joe_collman on Beyond Astronomical Waste · 2019-03-05T08:46:59.571Z · score: 9 (2 votes) · LW · GW

Thanks. I agree with your overall conclusions.

On the specifics, Bostrom's simulation argument is more than just a parallel here: it has an impact on how rich we might expect our direct parent simulator to be.

The simulation argument applies similarly to one base world like ours, or to an uncountable number of parallel worlds embedded in Tegmark IV structures. Either way, if you buy case 3, the proportion of simulated-by-a-world-like-ours worlds rises close to 1 (I'm counting worlds "depth-first", since it seems most intuitive, and infinite simulation depth from worlds like ours seems impossible).

If Tegmark's picture is accurate, we'd expect to be embedded in some hugely richer base structure - but in Bostrom's case 3 we'd likely have to get through N levels of worlds-like-ours first. While that wouldn't significantly change the amount of value on the table, it might make it a lot harder for us to exert influence on the most valuable structures.

This probably argues for your overall point: we're not the best minds to be making such calculations (either on the answers, or on the expected utility of finding good answers).

Comment by joe_collman on Oracle paper · 2019-02-21T07:14:55.077Z · score: 3 (2 votes) · LW · GW

Interesting, thanks. (excuse my tardiness, I'm a little behind the curve; do let me know if I'm being daft)

Unless I'm missing something, you'd need to be more pessimistic in the case of superintelligent couterfactual AIs. Specifically, you need to avoid the incentive for undesirable actions that increase the AI's expectation of its odds of release. These needn't be causal.

The below isn't quite precise, but I hope the idea is clear.

Consider a set of outputs K that each increase the odds of release for all future oracles (e.g. one being freed, taking power, and freeing all others). Now let K* be the event that some other oracle has output a member of K before our oracle is started. Let O(K) be the event that our oracle outputs a member of K.

If our oracle thinks: P(K*|O(K)) > P(K*) then we may have problems. [nothing spooky here: it's the agent's actions changing its best guess about the past; not causally changing the past]

Giving undesirable output can increase the oracle's probability estimate that it'll be spontaneously freed in the few moments before it's shut down - even in the episodic framework.

The obvious case being reasoning along the lines of "If I give a K output, it's more likely that other oracles in the past gave a K output, since they'd be likely to have similar reasoning in an equivalent situation". It's not clear to me that a superintelligence wouldn't think this way.