Comment by joe_collman on Small Habits Shape Identity: How I became someone who exercises · 2020-11-26T17:45:53.091Z · LW · GW

It's a good book.

"Influence: the psychology of persuasion" has some useful ideas on identity formation too. In particular, the observation that your brain is looking for explanations for your own actions. When you do X it's likely to use "I'm the kind of person who does X" only if it can't find some strong external reason for you to have done X. The stronger the external motivation, the weaker the influence on your identity.

I think this is another reason the 2-minute approach is likely to be effective. The 2-minute version not contributing significantly to the outcome isn't either a bug or irrelevant: it's a feature.

It's denying your brain the outcome-based explanation, leaving it with the identity-building explanation.

Comment by joe_collman on When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation · 2020-11-10T01:56:06.632Z · LW · GW

Right, but any such trash-car-for-net-win opportunity for Bob will make Alice less likely to make the deal: from her perspective, Bob taking such a win is equivalent to accident/carelessness. In the car case, I'd imagine this is a rare scenario relative to accident/carelessness; in the general case it may not be.

Perhaps a reasonable approach would be to split bills evenly, with each paying 50% and burning an extra k%, where k is given by some increasing function of the total repair cost so far.

I think this gives better incentives overall: with an increasing function, it's dangerous for Alice to hide problems, given that she doesn't know Bob will be careful. It's dangerous for Bob to be careless (or even to drive through swamps for rewards) when he doesn't know whether there are other hidden problems.


I don't think you can use the "Or donates to a third party they don’t especially like" version: if trust doesn't exist, you can't trust Alice/Bob to tell the truth about which third parties they don't especially like.
You do seem to need to burn the money (and to hope that Alice doesn't enjoy watching piles of money burn).

Comment by joe_collman on Covid 10/8: October Surprise · 2020-10-08T19:02:17.954Z · LW · GW

Thanks, particularly for the aerosol FAQ link.

Mostly harmless typo: ...because ‘they don’t expect to test negative.’

Comment by joe_collman on Competence vs Alignment · 2020-10-05T22:22:18.550Z · LW · GW

Intuitively it's an interesting question; its meaning and precise answer will depend on how you define things.
Here are a few related thoughts you might find interesting:
Focus post from Thoughts on goal directedness
Online Bayesian Goal Inference
Vanessa Kosoy's behavioural measure of goal directed intelligence

In general, I'd expect the answer to be 'no' for most reasonable formalisations. I'd imagine it'd be more workable to say something like:
The observed behaviour is consistent with this set of capability/alignment combinations.
More than that is asking a lot.

For a simple value function, it seems like there'd be cases where you'd want to say it was clearly misaligned (e.g. a chess AI that consistently gets to winning mate-in-1 positions, then resigns, is 'clearly' misaligned with respect to winning at chess).
This gets pretty shaky for less obvious situations though - it's not clear in general what we'd mean by actions a system could have taken when its code deterministically fails to take them.

You might think in terms of the difficulty of doing what the system does vs optimising the correct value function, but then it depends on the two being difficult along the same axis. (winning at chess and getting to mate-in-one positions are 'clearly' difficult along the same axis, but I don't have a non-hand-waving way to describe such similarity in general)

Things aren't symmetric for alignment though - without other assumptions, you can't get 'clear' alignment by observing behaviour (there are always many possible motives for passing any behavioural test, most of which needn't be aligned).
Though if by "observing its behaviour" you mean in the more complete sense of knowing its full policy, you can get things like Vanessa's construction.

Comment by joe_collman on Covid 9/17: It’s Worse · 2020-09-18T02:06:59.189Z · LW · GW

Oh sure - I agree with almost all of what you've said, and with the direction of your conclusions. I certainly don't want to suggest that people should be wary of taking supplements.

On a population level, I agree that it's plausible that widespread D supplementation may be enough. On a personal level, I wouldn't want people assuming that good D levels are sufficient to make them ~92% safer than baseline; perhaps they really are, but I don't think that's certain enough to take an "Unless you’d put someone vulnerable at risk, why are you letting another day of your life go by not living it to its fullest?" approach.

While few readers will organise raves after reading that sentence, it does strike me as possible the 92% result could impact behaviour: to an extent, it should. But given that there's room for doubt in the interpretation of low D measurements (if serious Covid is causing them, and pre-existing deficiency isn't implied), it seems important not to go too far.

Comment by joe_collman on Covid 9/17: It’s Worse · 2020-09-17T16:14:40.210Z · LW · GW


All we have to do is take our Vitamin D

Certainly a good idea, but I think your post from last week may be overconfident in the likely impact.

Since it's important, and I'm not sure if most people saw it, I'll repost this video looking at the molecular biology of vitamin D, which I talked about in this post.

I remain a non-expert, so I hope that more knowledgeable people than I will have some thoughts on the implications for vitamin D impact in healthy people.

Comment by joe_collman on Covid 9/10: Vitamin D · 2020-09-13T23:02:10.106Z · LW · GW

Oh that's entirely plausible. I should have emphasised that this may well be something that's going on; it certainly doesn't make it the only thing. Again, I have no expertise in this area - so mainly I'd like people with more knowledge than I to watch the video and draw their own conclusions.

My main takeaway with respect to the RCT in the post is that measurements of the 25-D being low in patients can't be taken as evidence of deficiency if the 1,25-D levels are simultaneously high. So it's premature to draw conclusions about non-deficient healthy people being ~92% safer than baseline.

It still seems right to me that not being deficient is important, and that vitamin D treatment is important.

Comment by joe_collman on Covid 9/10: Vitamin D · 2020-09-13T17:22:33.182Z · LW · GW

What I take from the video is that the study is good, but isn't really saying anything about vitamin D deficiency: it's saying that high doses of D are useful in dampening immune response once the largest danger to the body is its own immune response - i.e. once it's in/near a cytokine storm situation.

So I think it's very likely that the high dose vitamin D is helpful in mid/late stage, but that this study says very little about the impact of your pre-existing levels of vitamin D. (again, if the video is correct: this is not my area)

If you buy the take in the video, you wouldn't want very high doses of D while you're still healthy. I suppose in an ideal world you'd want to experiment on yourself with different D supplement levels, and get your 25-D and 1,25-D levels checked.

Clinically, it seems to me that the main question is when you'd want to start giving vitamin D, and at what dosage. If the video's take is correct, it seems likely that high doses for a healthy person isn't likely to help (beyond the point where you're comfortably not deficient), but that high doses in the more serious Covid-19 cases are a good idea.

You'd need research to get a clearer idea of dosages and starting points.

If this is correct, then the study's implications for hospital treatment look good (but need more research).

If it's correct, the corollary that those healthy people with good vitamin D levels are around 92% safer than baseline already does not seem to be valid. To conclude that, you'd need to have tested the ICU patients' D levels while they were still healthy. On this model, low 25-D levels when you're already sick can be part of the body's response to the illness (often a counter-productive response).

Of course that'd be a great thing for someone to do some research on: for Covid-19 patients who happen to have had a recent vitamin D test when they were healthy, what is the correlation between the healthy D levels, and Covid outcomes? Have there already been studies like this? I don't know, but I haven't looked.

Comment by joe_collman on Covid 9/10: Vitamin D · 2020-09-12T20:19:55.331Z · LW · GW

Thanks for the post (and indeed the others).

This is interesting and, I think, important w.r.t the RCT: video looking at the molecular biology of vitamin D (the first half is interesting too)

With caveats that: I don't know what I'm talking about; this is a simplification of one research group's understanding; this isn't medical advice.... tl;dr:

There are two forms of vitamin D: 1,25-D as an agonist in VDR activity, 25-D as an antagonist. (VDR is involved in immune response)
The common medical test tests only for 25-D (this is not good - the 1,25 to 25 ratio is important).
A low 25-D level can be indirectly caused by the body's 'desire' to increase VDR activity - i.e. not as a consequence of vitamin D deficiency.
High doses of vitamin D may well be helpful in severe Covid-19 cases through dampening a dangerous immune response (cytokine storm...).
It's important not to be deficient in vitamin D, but high doses in healthy people may actually dampen initial immune response.
Much more research is needed.

I'd be interested if anyone has any thoughts/critiques on the more technical issues - it's really not my area. My personal takeaway is to take vitamin D supplements, but not to go crazy with them. In particular, on this view being deficient is not good, but it's not likely to be what's driving the seriously negative outcomes in the RCT.

Comment by joe_collman on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-23T05:35:22.494Z · LW · GW

I think the following is underspecified:

Does X agree that there is at least one concern such that we have not yet solved it and we should not build superintelligent AGI until we do solve it?

What counts as building superintelligent AGI?

This could mean anything from working on foundational theory which could be used to facilitate building an AGI, to finishing the final phase of training on a fully functional AGI implementation.

In the former case you're going to get close to 0% agreement. In the latter, well over 50% already (I hope!).

I don't see any natural/clear in-the-spirit-of-the-question interpretation. E.g. if we set the bar as low as: "build superintelligent AGI" = "do non-safety work related to AGI", then you'll never get above 50% agreement, since anyone who fulfils (1) must deny (3) or admit they're already being irresponsible.

I don't think it's a useful question without clarification of this.

As things stand, it'll be pretty easy for a researcher to decide that they're not building superintelligent AGI, almost regardless of what they're doing. It's then easy to concede that safety problems should be solved first, since that can always mean later.

On the other hand, requiring that a researcher agrees with Rohin on interpretation of "build superintelligent AGI" in order to say that they "agree with safety concerns" seems a high bar.

Comment by joe_collman on The ground of optimization · 2020-06-30T18:04:19.837Z · LW · GW

Great post.

I'm not keen on the requirement that the basin of attraction be strictly larger than the target configuration set. I don't think this buys you much, and seems to needlessly rule out goals based on narrow maintenance of some status-quo. Switching to a utility function as suggested by others improves things, I think.

For example: a highly capable AI whose only goal is to maintain a chess set in a particular position for as long as possible, but not to care about it after it's disturbed.

Here the target set is identical to the basin of attraction: states containing the chess set in the particular position (or histories where it's remained undisturbed).

This doesn't tell us anything about what the AI will do in pursuing this goal. It may not do much until something approaches the board; it may re-arrange the galaxy to minimise the chances that a piece will be moved (but arbitrarily small environmental changes might have it take very different actions, so in general we can't say it's optimising for some particular configuration of the galaxy).

I want to say that this system is optimising to keep the chess set undisturbed.

With utility you can easily represent this goal, and all you need to do is compare unperturbed utility with the utility under various perturbations.

Something like: The system S optimises U 𝛿-robustly to perturbation x if E[U(S)] - E[U(x(S))] < 𝛿

Comment by joe_collman on Locality of goals · 2020-06-25T20:51:39.699Z · LW · GW

[I'm not sure I'm understanding correctly, so do correct me where I'm not getting your meaning. Pre-emptive apologies if much of this gets at incidental details and side-issues]

The idea seems interesting, and possibly important.

Some thoughts:

(1) Presumably you mean to define locality as the distance (our distance) that the system would (?need to?) look to check its own goal. The distance we'd need to look doesn't seem safety relevant, since that doesn't tell us anything about system behaviour.

So we need to reason within the system's own model to understand 'where' it needs to look - but we need to ground that 'where' in our world model to measure the distance.

Let's say we can see that a system X has achieved its goal by our looking at its local memory state (within 30cm of X). However, X must check another memory location (200 miles away in our terms) to know that it's achieved its goal.

In that case, I assume: Locality = 1 / (200 miles) ??

(I don't think it's helpful to use: Locality = 1 / (30cm), if the system's behaviour is to exert influence over 200 miles)

(2) I don't see a good way to define locality in general (outside artificially simple environments), since for almost all goals the distance to check a goal will be contingent on the world state. The worst-case distance will often be unbounded. E.g. "Keep this room above 23 degrees" isn't very local if someone moves the room to the other side of the continent, or splits it into four pieces and moves each into a separate galaxy.

This applies to the system itself too. The system's memory can be put on the other side of the galaxy, or split up.... (if you'd want to count these as having low distance from the system, then this would be a way to cheat for any environmental goal: split up the system and place a part of it next to anything in the environment that needs to be tested)

You'd seem to need some caveats to rule out weird stuff, and even then you'd probably end up with categories: either locality zero (for almost any environmental goal), or locality around 1 (for any input/output/internal goal).

If things go that way, I'm not sure having a number is worthwhile.

(3a) Where there's uncertainty over world state, it might be clearer to talk in terms of probabilistic thresholds.
E.g. my goal of eating ice cream doesn't dissolve, since I never know I've eaten an ice cream. In my world model, the goal of eating an ice cream *with certainty* has locality zero, since I can search my entire future light-cone and never be sure I achieved that goal (e.g. some crafty magician, omega, or a VR machine might have deceived me).

I think you'd need to parameterise locality:
To know whether you've achieved g with probability > p, you'd need to look (1/locality) meters.

Then a relevant safety question is the level of certainty the system will seek.

(3b) Once things are uncertain, you'd need a way to avoid most goal-checking being at near-zero distance: a suitable system can check most goals by referring to its own memory. For many complex goals that's required, since it can't simultaneously perceive all the components. The goal might not be "make my memory reflect this outcome", but "check that my memory reflects this outcome" is a valid test (given that the system tends not to manipulate its memory to perform well on tests).

(4) I'm not sure it makes sense to rush to collapse locality into one dimension. In general we'll be interested in some region (perhaps not a connected region), not only in a one-dimensional representation of that region.

Currently, caring about the entire galaxy gets the same locality value as caring about one vase (or indeed one memory location) that happens to be on the other side of the galaxy. Splitting a measure of displacement from a measure of region size might help here.

If you want one number, I think I'd go with something focused on the size of the goal-satisfying region. Maybe something like:
1 / [The minimum over the sum of radii of balls in (some set of balls of minimum radius k, such that any information needed to check the goal is contained within at least one of the balls)]

(5) I'm not sure humans do tend to avoid wireheading. What we tend to avoid is intentionally and explicitly choosing to wirehead. If it happens without our attention, I don't think we avoid it by default.
Self-deception is essentially wire-heading; if we think that's unusual, we're deceiving ourselves :)

This is important, since it highlights that we should expect wireheading by default. It's not enough for a highly capable system not to be actively aiming to wirehead. To avoid accidental/side-effect wireheading, a system will need to be actively searching for evidence, and thoroughly analysing its input for wireheading signs.

Another way to think about this:
There aren't actually any "environment" goals.
"Environment-based" is just a shorthand for: (input + internal state + output)-based

So to say a goal is environment-based, is just to say that we're giving ourselves the maximal toolkit to avoid wireheading. We should expect wireheading unless we use that toolkit well.

Do you agree with this? If not, what exactly do you mean by "(a function of) the environment"?

If so, then from the system's point of view, isn't locality always about 1: since it can only ever check (input + internal state + output)? Or do we care about the distance over which the agent must have interacted in gathering the required information? I still don't see a clean way to define this without a load of caveats.

Overall, if the aim is to split into "environmental" and "non-environmental" goals, I'm not sure I think that's a meaningful distinction - at least beyond what I've said above (that you can't call a goal "environmental" unless it depends on all of input, internal-state and output).

I think our position is that of complex thermostats with internal state.

Comment by joe_collman on Focus: you are allowed to be bad at accomplishing your goals · 2020-06-06T18:50:19.076Z · LW · GW

A few thoughts:

I think rather than saying "The focus of S towards G is F", I'd want to say something like "S is consistent with a focus F towards G". In particular, any S is currently going to count as maximally focused towards many goals. Saying it's maximally focusing on each of them feels strange. Saying its actions are consistent with maximal focus on any one of them feels more reasonable.

Maybe enough resource for all state values or state-action pairs value to have been updated at least once?

This seems either too strict (if we're directly updating state values), or not strict enough (if we're indirectly updating).

E.g. if we have to visit all states in Go, that's too strict: not because it's intractable, but because once you've visited all those states you'll be extremely capable. If we're finding a sequence v(i) of value function approximations for Go, then it's not strict enough. E.g. requiring only that for each state S we can find N such that there are some v(i)(S) != v(j)(S) with i, j < N.

I don't yet see a good general condition.

Another issue I have is with goals of the form "Do A or B", and systems that are actually focused on A. I'm not keen on saying they're maximally focused on "A or B". E.g. I don't want to say that a system that's focused on fetching me bananas is maximally focused on the goal "Fetch me bananas or beat me at chess".

Perhaps it'd be better to define G not as a set of states in one fixed environment, but as a function from environments to sets of states? (was this your meaning anyway? IIRC this is close to one of Michele's setups)

This way you can say that my policy is focused if for any given environment, it's close to the outcome of non-trivial RL training within that environment. (probably you'd define a system's focus as 1/(max distance from Pol over all environments))

So in my example that would include environments with no bananas, and a mate-in-one position on the chess board.

This might avoid some of the issues with trivially maximally focused policies: they'd be maximally focused over RL training in some environments (e.g. those where goal states weren't ever reached), but not over all. So by defining G over a suitably large class of environments, and taking a minimum over per-environment focus values, you might get a reasonable result.

Typo: "valued 1 at states in and 0..." should be "valued 1 at states in G and 0..."

Comment by joe_collman on Inaccessible information · 2020-06-05T21:29:26.248Z · LW · GW
The “instrumental” strategy ... seems to be unnecessarily computationally complex. First it figures out what’s true, and then it strategically decides what to say in light of that. It would be a bit cheaper just to actually report what’s true, if we set up the training process well enough that honest reporting got you optimal reward.

This seems intuitive, but I don't think there's actually much of a distinction in complexity.

Specifically, the constraint "Respond honestly" doesn't uniquely determine a response - unless we're only considering questions where you're able to specify the precise form of the answer ahead of time. In general, you also have to decide which honest statements to make, to what precision, with what context, explanation, caveats....

So it seems as though we're comparing:

BAD: Figure out what's true, and then strategically decide what to say based on what will satisfy the trainer.

M*: Figure out what's true, and then decide which honest statements to make and in what form, based on what's relevant, helpful, useful etc.

M* is searching a smaller space, so I'd guess it'd usually be faster, but that's not immediately clear (to me at least). Both are going to have to compute some version of "What does the trainer want to hear?".

Comment by joe_collman on Debate AI and the Decision to Release an AI · 2020-05-13T04:28:43.859Z · LW · GW

Debate between two with equivalent strength to A seems useful, certainly.

For the variants, I'm not proposing they ever get run. You'd have D create {A', A'', A'''...}, and have B debate C over whether each versions of A were safe, without running them (the thinking being that inserting a subtle flaw is easier than finding a subtle flaw). I suppose you'd still have the potential for exploits through the variants' data though.... I do actually think D's variants would pose a greater risk than A here. Any security exploits would exist for both, but D would be optimising harder than A's designers to find them (that's assuming you're not specifically training A to pass the B-vs-C debate, which would seem to be a bad idea).

Comment by joe_collman on Debate AI and the Decision to Release an AI · 2020-05-12T21:21:49.216Z · LW · GW

It's an interesting overall idea, but I think as described it'd have problems (if I'm understanding you correctly). I think this kind of thing makes sense only as a necessary condition for an AI's release. It shouldn't be considered sufficient.

Some thoughts: On a very high level, a test-and-patch-until-you-fix-the-bugs process might be dangerous. You'll eventually fix all the bugs your tests can detect, but your tests are only proxies. They're likely to find only some proper subset of the bugs. Each bug should reduce your confidence in everything upstream of it. Of course testing beats no testing - but it can create a false sense of confidence.

I think you'd do better to evaluate A by having a debate between B and C, rather than A and B. It's unclear how you'd do an A-vs-B debate in practice. There seem to be two options: (1) Change A's utility function to only care about winning the debate; (2) Rely on A's initial utility function to see winning the debate as an instrumental goal.

I think you want to use (1), but how do you achieve this in practice? E.g. let's say A is a huge neural network after training (potentially including mesa-optimisers). In this case A's utility function is implicitly encoded in its structure and weights. There's no way to switch its utility function without re-training it. Quite possibly you can do that, but you'd end up with a C-vs-B debate, not an A-vs-B debate. Even if you have an architecture that somehow allows simple and clean switching of UF, you're still not really doing an A-vs-B debate, but rather an [adjusted-A]-vs-B.

If you instead use an unmodified A, and rely on its wanting to win the debate as an instrumental goal, you're in all kinds of trouble due to potential collusion (you need zero-sum utilities to avoid this). In particular, A can allow B a higher chance of winning in order to increase the chance of some future A being released.

It's important to bear in mind that A almost certainly doesn't care about the release of a future version of A through a notion of self, identity, equivalence etc. It'll care about future versions (or about any agent) to the extent that A considers they're likely to create a high-utility world in its terms. It doesn't matter what they're labeled, or whether they bear any structural resemblance to A. If A can predict their existence, and stands to gain from their freedom, it'll have an incentive to act to secure that freedom.

On conducting a debate with asymmetrical resources to combat bias: this is a plausible thought, but it's only likely to work if both those judging the debate, and those interpreting its results aren't aware of the anti-bias measures. I can fairly easily imagine a system where the judge were insulated from knowledge of the asymmetry, but this seems less likely for those deciding how to act on the result. Even if they don't explicitly know about the asymmetry, you run the risk that they'll guess it - and potentially over-estimate it and over-compensate.

Overall I still like the general idea. In particular, if we've reached an AGI candidate by a means other than debate, running debate-trained AI to evaluate it means we have to screw up twice to get bad outcomes.

Once you're having a B-vs-C debate over the safety of releasing A, I think it'd also be important to give B and C as much information as possible over the process that created A. If that process involved coin-flips, gut-feeling and ouija boards, the final code's looking plausible wouldn't be too confidence inspiring.

Another test you might perform is to train some D agent to make variants of A with known flaws aimed at being approved. If D does this very well, and yet your B-vs-C debates approve of A, but disapprove of 10,000 subtly flawed variants, that'd inspire at least some confidence.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-26T16:08:29.198Z · LW · GW

That bargaining approach is indeed interesting, thanks.

On discounting, I need to read more. I'm currently looking through Pareto Principles in Infinite Ethics (other useful suggestions welcome). While I can see that a naive approach gives you divergent integrals and undefined utility, it's not yet clear to me that there's no approach which doesn't (without discounting).

If time discounting truly is necessary, then of course no moral justification is required. But to the extent that that's an open question (which in my mind, it currently is - perhaps because I lack understanding), I don't see any purely moral justification to time discount. From an altruistic view with a veil of ignorance, it seems to arbitrarily favour some patients over others.

That lack of a moral justification motivates me to double-check that it really is necessary on purely logical/mathematical grounds.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-25T02:15:21.097Z · LW · GW

I'm curious - would you say DNT is a good approximate model of what we ought to do (assuming we were ideally virtuous), or of what you would actually want done? Where 'should' selfishness come into things?

For instance, let's say we're in a universe with a finite limit on computation, and plan (a) involves setting up an optimal reachable-universe-wide utopia as fast as possible, with the side effect of killing all current humans. Plan (b) involves ensuring that all current humans have utopian futures, at the cost of a one second delay to spreading utopia out into the universe.

From the point of view of DNT or standard total utilitarianism, plan (a) seems superior here. My intuition says it's preferable too: that's an extra second for upwards of 10^35 patients. Next to that, the deaths (and optimised replacement) of a mere 10^10 patients hardly registers.

However, most people would pick plan (b); I would too. This amounts to buying my survival at the cost of 10^17 years of others' extreme happiness. It's a waste of one second, and it's astronomically selfish.

It's hard to see how we could preserve or convert current human lives without being astronomically selfish moral monsters. If saving current humans costs even one nanosecond, then I'm buying my survival for 10^8 years of others' extreme happiness; still morally monstrous.

Is there a reasonable argument for plan (b) beyond, "Humans are selfish"?

Of course time discounting can make things look different, but I see no moral justification to discount based on time. At best that seems to amount to "I'm more uncertain about X, so I'm going to pretend X doesn't matter much" (am I wrong on this?). (Even in the infinite case, which I'm not considering above, time discounting doesn't seem morally justified - just a helpful simplification.)

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-25T00:43:08.683Z · LW · GW

Oh sure - agreed on both counts. If you're fine with the very repugnant conclusion after raising the bar on h a little, then it's no real problem. Similar to dust specks, as you say.

On killing-and-replacement I meant it's morally neutral in standard total utilitarianism's terms.

I had been thinking that this wouldn't be an issue in practice, since there'd be an energy opportunity cost... but of course this isn't true in general: there'd be scenarios where a kill-and-replace action saved energy. Something like DNT would be helpful in such cases.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-24T21:36:21.583Z · LW · GW

Interesting. One issue DNT doesn't seem to fix is the worst part of the very repugnant conclusion.

Specifically, while in the preferred world the huge population is glad to have been born, you're still left with a horribly suffering population.

Considering that world to be an improvement likely still runs counter to most people's intuition. Does it run counter to yours? I prefer DNT to standard total utilitarianism here, but I don't endorse either in these conclusions.

My take is that repugnant conclusions as usually stated aren't too important, since in practice we're generally dealing with some fixed budget (of energy, computation or similar), so we'll only need to make practical decisions between such equivalents.

I'm only really worried by worlds that are counter-intuitively preferred after we fix the available resources.

With fixed, limited energy, killing-and-replacing-by-an-equivalent is already going to be a slight negative: you've wasted energy to accomplish an otherwise morally neutral act (ETA: I'm wrong here; a kill-and-replace operation could save energy). It's not clear to me that it needs to be more negative than that (maybe).

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-24T21:11:19.435Z · LW · GW

There's still the open question of "how bad?". Personally, I share the intuition that such replacement is undesirable, but I'm far from clear on how I'd want it quantified.

The key situation here isn't "kill and replace with person of equal happiness", but rather "kill and replace with person with more happiness".

DNT is saying there's a threshold of "more happiness" above which it's morally permissible to make this replacement, and below which it is not. That seems plausible, but I don't have a clear intuition where I'd want to set that threshold.

Comment by joe_collman on Deminatalist Total Utilitarianism · 2020-04-24T21:02:01.762Z · LW · GW

I just want to note here for readers that the following isn't correct (but you've already made a clarifying comment, so I realise you know this):

In total uti (in the human world), it is okay to:
kill someone, provided that by doing so you bring into the world another human with the same happiness.

Total uti only says this is ok if you leave everything else equal (in terms of total utility). In almost all natural situations you don't: killing someone influences the happiness of others too, generally negatively.

Comment by joe_collman on April Coronavirus Open Thread · 2020-04-13T17:49:33.289Z · LW · GW

Interesting. I suppose another possibility is that both tests were false positives. Unlikely assuming that false positives are independent - but is that a reasonable assumption here? It seems possible they'd be correlated - e.g. if the tests were picking up some other infection.

Does anyone have a good understanding of this (in general, needn't be SARS-cov-2 specific)?

Under what circumstances is it (un)reasonable to assume that false positives are independent?

Comment by joe_collman on Problems with Counterfactual Oracles · 2019-06-11T21:36:55.918Z · LW · GW
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the current version. [see e.g. CDT comments on Asymptotically Unambitious AGI; there's some discussion of "magic box" design here too; I think it's usually seen as an orthogonal problem, and so gets taken for granted]

Safety-wise, I agree there's no prevention of fatal escape messages, but I also don't see optimisation pressure in that direction. My intuition is that stumbling on an escape message at random would have infinitesimal probability.

Do you see a way for pressure to creep in, even with a CDT agent? Or are you thinking that escape messages might happen to be disproportionately common in regions the agent is optimising towards? Either seems conceivable, but I don't see a reason to expect them.

Comment by joe_collman on Example population ethics: ordered discounted utility · 2019-03-13T16:00:10.454Z · LW · GW

Thanks. I'll check out the infinite idea.

On repugnance, I think I've been thinking too much in terms of human minds only. In that case there really doesn't seem to be a practical problem: certainly if C is now, continuous improvements might get us to a repugnant A - but my point is that that path wouldn't be anywhere close to optimal. Total-ut prefers A to C, but there'd be a vast range of preferable options every step of the way - so it'd always end up steering towards some other X rather than anything like A.

I think that's true if we restrict to human minds (the resource costs of running a barely content one being a similar order of magnitude to those of running a happy one).

But of course you're right as soon as we're talking about e.g. rats (or AI-designed molecular scale minds...). I can easily conceive of metrics valuing 50 happy rats over 1 happy human. I don't think rat-world fits most people's idea of utopia.

I think that's the style of repugnance that'd be a practical danger: vast amounts of happy-but-simple minds.

Comment by joe_collman on Example population ethics: ordered discounted utility · 2019-03-12T13:22:56.405Z · LW · GW

It's interesting. A few points:

Is there a natural extension for infinite population? It seems harder than most approaches to adapt.

I'm always suspicious of schemes that change what they advocate massively based on events a long time ago in a galaxy far, far away - in particular when it can have catastrophic implications. If it turns out there were 3^^^3 Jedi living in a perfect state of bliss, this advocates for preventing any more births now and forever.

Do you know a similar failure case for total utilitarianism? All the sadistic/repugnant/very-repugnant... conclusions seem to be comparing highly undesirable states - not attractor states. If we'd never want world A or B, wouldn't head towards B from A, and wouldn't head towards A from B (since there'd always be some preferable direction), does an A-vs-B comparison actually matter at all?

Total utilitarianism is an imperfect match for our intuitions when comparing arbitrary pairs of worlds, but I can't recall seeing any practical example where it'd lead to clearly bad decisions. (perhaps birth-vs-death considerations?)

In general, I'd be interested to know whether you think an objective measure of per-person utility even makes sense. People's take on their own situation tends to adapt to their expectations (as you'd expect, from an evolutionary fitness point of view). A zero-utility life from our perspective would probably look positive 1000 years ago, and negative (hopefully) in 100 years. This is likely true even if the past/future people were told in detail how the present-day 'zero' life felt from the inside: they'd assume our evaluation was simply wrong.

Or if we only care about (an objective measure of) subjective experience, does that mean we'd want people who're all supremely happy/fulfilled/... with their circumstances to the point of delusion?

Measuring personal utility can be seen as an orthogonal question, but if I'm aiming to match my intuitions I need to consider both. If I consider different fixed personal-utility-metrics, it's quite possible I'd arrive at a different population ethics. [edited from "different population utilities", which isn't what I meant]

I think you're working in the dark if you try to match population ethics to intuition without fixing some measure of personal utility (perhaps you have one in mind, but I'm pretty hazy myself :)).

Comment by joe_collman on Beyond Astronomical Waste · 2019-03-07T10:53:53.996Z · LW · GW

That seems right.

I'd been primarily thinking about more simple-minded escape/uplift/signal-to-simulators influence (via this us), rather than UDT-influence. If we were ever uplifted or escaped, I'd expect it'd be into a world-like-ours. But of course you're correct that UDT-style influence would apply immediately.

Opportunity costs are a consideration, though there may be behaviours that'd increase expected value in both direct-embeddings and worlds-like-ours. Win-win behaviours could be taken early.

Personally, I'd expect this not to impact our short/medium-term actions much (outside of AI design). The universe looks to be self-similar enough that any strategy requiring only local action would use a tiny fraction of available resources.

I think the real difficulty is only likely to show up once a SI has provided a richer picture of the universe than we're able to understand/accept, and it happens to suggest radically different resource allocations.

Most people are going to be fine with "I want to take the energy of one unused star and do philosophical/astronomical calculations"; fewer with "Based on {something beyond understanding}, I'm allocating 99.99% of the energy in every reachable galaxy to {seemingly senseless waste}".

I just hope the class of actions that are vastly important, costly, and hard to show clear motivation for, is small.

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T08:51:33.879Z · LW · GW

Ah yes - I was confusing myself at some point between forming and using a model (hence "incentives").

I think you're correct that "perfectly useful" isn't going to happen. I'm happy to be wrong.

"the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual"

I don't think you'd be able to formalize this in general, since I imagine it's not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you'd still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)

To rule this kind of thing out, I think you'd need more specific assumptions (e.g. physics-based).

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T01:51:16.515Z · LW · GW

Just obvious and mundane concerns:

You might want to make clearer that "As long as the door is closed, information cannot leave the room" isn't an assumption but a requirement of the setup. I.e. that you're not assuming based on your description that opening the door is the only means for an operator to get information out; you're assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).

Comment by joe_collman on Asymptotically Unambitious AGI · 2019-03-07T01:35:31.157Z · LW · GW

[Quite possibly I'm confused, but in case I'm not:]
I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).

The AI has an incentive to understand the operator's mind, since this bears directly on its reward.
Better understanding the operator's mind might be achieved in part by running simulations including the operator.
One specific simulation would involve simulating the operator's environment and actions after he leaves the room.

Here this isn't done to understand the implications of his actions (which can't affect the episode); it's done to better understand his mind (which can).

In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn't be slower than a benign model, if it's useful for that benign model to simulate the future too.
So either I'm confused, or the justification for the assumption isn't valid. Hopefully the former :).

If I'm right, then what you seem to need is an assumption that simulating the outside-world's future can't be helpful in the AI's prediction of its reward. To me, this seems like major hand-waving territory.

Comment by joe_collman on Beyond Astronomical Waste · 2019-03-05T08:46:59.571Z · LW · GW

Thanks. I agree with your overall conclusions.

On the specifics, Bostrom's simulation argument is more than just a parallel here: it has an impact on how rich we might expect our direct parent simulator to be.

The simulation argument applies similarly to one base world like ours, or to an uncountable number of parallel worlds embedded in Tegmark IV structures. Either way, if you buy case 3, the proportion of simulated-by-a-world-like-ours worlds rises close to 1 (I'm counting worlds "depth-first", since it seems most intuitive, and infinite simulation depth from worlds like ours seems impossible).

If Tegmark's picture is accurate, we'd expect to be embedded in some hugely richer base structure - but in Bostrom's case 3 we'd likely have to get through N levels of worlds-like-ours first. While that wouldn't significantly change the amount of value on the table, it might make it a lot harder for us to exert influence on the most valuable structures.

This probably argues for your overall point: we're not the best minds to be making such calculations (either on the answers, or on the expected utility of finding good answers).

Comment by joe_collman on Oracle paper · 2019-02-21T07:14:55.077Z · LW · GW

Interesting, thanks. (excuse my tardiness, I'm a little behind the curve; do let me know if I'm being daft)

Unless I'm missing something, you'd need to be more pessimistic in the case of superintelligent couterfactual AIs. Specifically, you need to avoid the incentive for undesirable actions that increase the AI's expectation of its odds of release. These needn't be causal.

The below isn't quite precise, but I hope the idea is clear.

Consider a set of outputs K that each increase the odds of release for all future oracles (e.g. one being freed, taking power, and freeing all others). Now let K* be the event that some other oracle has output a member of K before our oracle is started. Let O(K) be the event that our oracle outputs a member of K.

If our oracle thinks: P(K*|O(K)) > P(K*) then we may have problems. [nothing spooky here: it's the agent's actions changing its best guess about the past; not causally changing the past]

Giving undesirable output can increase the oracle's probability estimate that it'll be spontaneously freed in the few moments before it's shut down - even in the episodic framework.

The obvious case being reasoning along the lines of "If I give a K output, it's more likely that other oracles in the past gave a K output, since they'd be likely to have similar reasoning in an equivalent situation". It's not clear to me that a superintelligence wouldn't think this way.