## Posts

Review of "Learning Normativity: A Research Agenda" 2021-06-06T13:33:28.371Z
Review of "Fun with +12 OOMs of Compute" 2021-03-28T14:55:36.984Z
A Critique of Non-Obstruction 2021-02-03T08:45:42.228Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z

Comment by Joe_Collman on Dating Minefield vs. Dating Playground · 2021-09-16T20:31:36.166Z · LW · GW

Worth noting that the categories aren't mutually exclusive (or the graph is just wrong). So e.g. there may be many people who met neighbours at church, or met coworkers at bars.

This may also help to explain the online curve accelerating hard, then hitting the restaurant/bar curve like a wall. Either early adopters of online dating were all restaurant/bar-meeting people, or the restaurant/bar people were early to be fine with reporting having met online (or both).

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-09-04T00:13:23.226Z · LW · GW

Thanks, that's interesting. [I did mean to reply sooner, but got distracted]

A few quick points:

Yes, by "incoherent causal model" I only mean something like "causal model that has no clear mapping back to a distribution over real worlds" (e.g. where different parts of the model assume that [kite exists] has different probabilities).
Agreed that the models LCDT would use are coherent in their own terms. My worry is, as you say, along garbage-in-garbage-out lines.

Having LCDT simulate HCH seems more plausible than its taking useful action in the world - but I'm still not clear how we'd avoid the LCDT agent creating agential components (or reasoning based on its prediction that it might create such agential components) [more on this here: point (1) there seems ok for prediction-of-HCH-doing-narrow-task (since all we need is some non-agential solution to exist); point (2) seems like a general problem unless the LCDT agent has further restrictions].

Agreed on HCH practical difficulties - I think Evan and Adam are a bit more optimistic on HCH than I am, but no-one's saying it's a non-problem. From the LCDT side, it seems we're ok so long as it can simulate [something capable and aligned]; HCH seems like a promising candidate.

On HCH-simulation practical specifics, I think a lot depends on how you're generating data / any model of H, and the particular way any [system that limits to HCH] would actually limit to HCH. E.g. in an IDA setup, the human(s) in any training step will know that their subquestions are answered by an approximate model.

I think we may be ok on error-compounding, so long as the learned model of humans is not overconfident of its own accuracy (as a model of humans). You'd hope to get compounding uncertainty rather than compounding errors.

Comment by Joe_Collman on Covid 9/2: Long Covid Analysis · 2021-09-02T16:19:53.762Z · LW · GW

Thanks again.
Missing link?: "This post is mainly about the question of whether Biden..." (which post?)

Comment by Joe_Collman on Covid 8/26: Full Vaccine Approval · 2021-08-26T21:51:26.669Z · LW · GW

I assume that most people are imagining eradication plans using [something close to current tools], and see solving the required coordination problems as a non-starter in that context.

I think that for eradication to be a realistic prospect you'd need a plan which a country could implement unilaterally with significant chance of long-term success. That seems to require new tools. I don't have a good sense of the odds of finding such tools, or the costs involved (either in research or implementation).

It would still be nice to see some informed discussion exploring the possibilities.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-20T15:33:16.309Z · LW · GW

Interesting, thanks.

However, I don't think this is quite right (unless I'm missing something):

Now observe that in the LCDT planning world model  constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I'd argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.

I don't think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that's the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper's likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent's stronger side).
If the agent knew the keeper would definitely go left, then of course it'd kick right - but I don't think that's the situation.

I'd be interested on your take on Evan's comment on incoherence in LCDT. Specifically, do you think the issue I'm pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I'm just wrong about the incoherence??)
As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world - but I might be wrong in either case.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-14T00:29:12.036Z · LW · GW

Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I'm still largely reasoning about it "from outside", since it feels like it's trying to do the impossible).

For instance:

1. I agree that the objective of simulating an agent isn't a problem. I'm just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here?
If for all x in X, sim_A(x) = A(x), then if A is behaviourally an agent over X, sim_A seems to be also.(Replacing equality with approximate equality doesn't seem to change the situation much in principle)
[Pre-edit: Or is the idea that we're usually only concerned with simulating some subset of the agent's input->output mapping, and that a restriction of some function may have different properties from the original function? (agenthood being such a property)]
1. I can see that it may be possible to represent such a simulation as a group of nodes none of which is individually agentic - but presumably the same could be done with a human. It can't be ok for LCDT to influence agents based on having represented them as collections of individually non-agentic components.
2. Even if sim_A is constructed as a Chinese room (w.r.t. agenthood), it's behaving collectively as an agent.
2. "it's just that LCDT will try to simulate that agent exclusively via non-agentic means" - mostly agreed, and agreed that this would be a good thing (to the extent possible).
However, I do think there's a significant difference between e.g.:
[LCDT will not aim to instantiate agents] (true)
vs
[LCDT will not instantiate agents] (potentially false: they may be side-effects)

Side-effect-agents seem plausible if e.g.:
a) The LCDT agent applies adjustments over collections within its simulation.
b) An adjustment taking [useful non-agent] to [more useful non-agent] also sometimes takes [useful non-agent] to [agent].

Here it seems important that LCDT may reason poorly if it believes that it might create an agent. I agree that pre-decision-time processing should conclude that LCDT won't aim to create an agent. I don't think it will conclude that it won't create an agent.
3. Agreed that finite factored sets seem promising to address any issues that are essentially artefacts of representations. However, the above seem more fundamental, unless I'm missing something.

Assuming this is actually a problem, it struck me that it may be worth thinking about a condition vaguely like:

• An  agent cuts links at decision time to every agent other than [ agents where m > n].

The idea being to specify a weaker condition that does enough forwarding-the-guarantee to allow safe instantiation of particular types of agent while still avoiding deception.

I'm far from clear that anything along these lines would help: it probably doesn't work, and it doesn't seem to solve the side-effect-agent problem anyway: [complete indifference to influence on X] and [robustly avoiding creation of X] seem fundamentally incompatible.

Thoughts welcome. With luck I'm still confused.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-10T23:12:37.547Z · LW · GW

Ah yes, you're right there - my mistake.

However, I still don't see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic.
Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.

It seems to me that you'll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path.

I don't see how an LCDT agent can make efficient adjustments to its simulation when it won't be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-10T21:07:10.284Z · LW · GW

Ok thanks, I think I see a little more clearly where you're coming from now.
(it still feels potentially dangerous during training, but I'm not clear on that)

A further thought:

The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way

Ok, so suppose for the moment that HCH is aligned, and that we're able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model.
I'm not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn't think it can influence anything downstream of those (or the probability that they exist, I assume).

That seems to be the place where LCDT needs to do real work, and I don't currently see how it can do so efficiently. If there are agentic elements contributing to the simulation's output, then it won't think it can influence the output.
Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can't be sufficiently inclusive.

...but hopefully I'm still confused somewhere.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-10T06:51:21.799Z · LW · GW

Right, as far as I can see, it achieves the won't-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it's using LCDT).

As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn't care about those). So it seems that it's going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about.

We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.

Quite possibly I'm still missing something, but I don't currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?).
I can picture its being a useful wrapper around a simulation, but it's not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-07T01:03:08.040Z · LW · GW

[Pre-emptive apologies for the stream-of-consciousness: I made the mistake of thinking while I wrote. Hopefully I ended up somewhere reasonable, but I make no promises]

simulating HCH or anything really doesn't require altering the action set of a human/agent

My point there wasn't that it requires it, but that it entails it. After any action by the LCDT agent, the distribution over future action sets of some agents will differ from those same distributions based on the prior (perhaps very slightly).

E.g. if I burn your kite, your actual action set doesn't involve kite-flying; your prior action set does. After I take the [burn kite] action, my prediction of [kite exists] doesn't have a reliable answer.

If I'm understanding correctly (and, as ever, I may not be), this is just to say that it'd come out differently based on the way you set up the pre-link-cutting causal diagram. If the original diagram effectively had [kite exists iff Adam could fly kite], then I'd think it'd still exist after [burn kite]; if the original had [kite exists iff Joe didn't burn kite] then I'd think that it wouldn't.

In the real world, those two setups should be logically equivalent. The link-cutting breaks the equivalence. Each version of the final diagram functions in its own terms, but the answer to [kite exists] becomes an artefact of the way we draw the initial diagram. (I think!)

In this sense, it's incoherent (so Evan's not claiming there's no bullet, but that he's biting it); it's just less clear that it matters that it's incoherent.

I still tend to think that it does matter - but I'm not yet sure whether it's just offending my delicate logical sensibilities, or if there's a real problem.

For instance, in my reply to Evan, I think the [delete yourself to free up memory] action probably looks good if there's e.g. an [available memory] node directly downstream of the [delete yourself...] action.
If instead the path goes [delete yourself...] --> [memory footprint of future self] --> [available memory], then deleting yourself isn't going to look useful, since [memory footprint...] shouldn't change.

Perhaps it'd work in general to construct the initial causal diagrams in this way:
You route maximal causality through agents, when there's any choice.
So you then tend to get [LCDT action] --> [Agent action-set-alteration] --> [Whatever can be deduced from action-set-alteration].

You couldn't do precisely this in general, since you'd need backwards-in-time causality - but I think you could do some equivalent. I.e. you'd put an [expected agent action set distribution] node immediately after the LCDT decision, treat that like an agent at decision time, and deduce values of intermediate nodes from that.

So in my kite example, let's say you'll only get to fly your kite (if it exists) two months from my decision, and there's a load of intermediate nodes.
But directly downstream of my [burn kite] action we put a [prediction of Adam's future action set] node. All of the causal implications of [burn kite] get routed through the action set prediction node.

Then at decision time the action-set prediction node gets treated as part of an agent, and there's no incoherence. (but I predict that my [burn kite] fails to burn your kite)

Anyway, quite possibly doing things this way would have a load of downsides (or perhaps it doesn't even work??), but it seems plausible to me.

My remaining worry is whether getting rid of the incoherence in this way is too limiting - since the LCDT agent gets left thinking its actions do almost nothing (given that many/most actions would be followed by nodes which negate their consequences relative to the prior).

[I'll think more about whether I'm claiming much/any of this impacts the simulation setup (beyond any self-deletion issues)]

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-06T23:15:11.264Z · LW · GW

Ah ok. Weird, but ok. Thanks.

Perhaps I'm now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action.
More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?

I need to think more about the rest.

[EDIT and see rambling reply to Adam re ways to avoid the incoherence. TLDR: I think placing a [predicted agent action set alterations] node directly after the LCDT decision node in the original causal diagram, deducing what can be deduced from that node, and treating it as an agent at decision-time might work. It leaves the LCDT agent predicting that many of its actions don't do much, but it does get rid of the incoherence (I think). Currently unclear whether this throws the baby out with the bathwater; I don't think it does anything about negative side-effects]

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-06T13:43:01.544Z · LW · GW

LCDT agents cannot believe in ANY influence of their actions on other agents.

And my point is simply that once this is true, they cannot (coherently) believe in any influence of their actions on the world (in most worlds).

In (any plausible model of) the real world, any action taken that has any consequences will influence the distribution over future action sets of other agents.

I.e. I'm saying that [plausible causal world model] & [influences no agents] => [influences nothing]

So the only way I can see it 'working' are:
1) To agree it always influences nothing (I must believe that any action I take as an LCDT agent does precisely nothing).
or
2) To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you're still flying that probably-non-existent kite.

So I don't see how an LCDT agent makes any reliable predictions.

[EDIT: if you still think this isn't a problem, and that I'm confused somewhere (which I may be), then I think it'd be helpful if you could give an LCDT example where:
The LCDT agent has an action x which alters the action set of a human.
The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human's action. (of course I'm not saying the conclusions should be rational - just that they shouldn't be nonsense)]

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-06T04:48:49.026Z · LW · GW

I need to think more about it...

Me too!

First we know which part of the causal model correspond to the human, which is not the case in the NN

This doesn't follow only from [we know X is an LCDT agent that's modeling a human] though, right? We could imagine some predicate/constraint/invariant that detects/enforces/maintains LCDTness without necessarily being transparent to humans.
I'll grant you it seems likely so long as we have the right kind of LCDT agent - but it's not clear to me that LCDTness itself is contributing much here.

The human will be modeled only by variables on this part of the causal graph, whereas it could be completely distributed over a NN

At first sight this seems at least mostly right - but I do need to think about it more. E.g. it seems plausible that most of the work of modeling a particular human H fairly accurately is in modeling [humans-in-general] and then feeding H's properties into that. The [humans-in-general] part may still be distributed.
I agree that this is helpful. However, I do think it's important not to assume things are so nicely spatially organised as they would be once you got down to a molecular level model.

a causal model seems to give way more information than a NN, because it encodes the causal relationship, whereas a NN could completely compute causal relationships in a weird and counterintuitive way

My intuitions are in the same direction as yours (I'm playing devil's advocate a bit here - shockingly :)). I just don't have principled reasons to think it actually ends up more informative.

I imagine learned causal models can be counter-intuitive too, and I think I'd expect this by default. I agree that it seems much cleaner so long as it's using a nice ontology with nice abstractions... - but is that likely? Would you guess it's easier to get the causal model to do things in a 'nice', 'natural' way than it would be for an NN? Quite possibly it would be.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-06T03:35:15.579Z · LW · GW

Ok, so if I understand you correctly (and hopefully I don't!), you're saying that as an LCDT agent I believe my prior determines my prediction of:
1) The distribution over action spaces of the human.
2) The distribution over actions the human would take given any particular action space.

So in my kite example, let's say my prior has me burn your kite with 10% probability.
So I believe that you start out with:
0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate]
0.1 chance of the action set [Angrily gesticulate]

In considering my [burn kite] option, I must believe that taking the action doesn't change your distribution over action sets - i.e. that after I do [burn kite] you still have a 0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate]. So I must believe that [burn kite] does nothing.

Is that right so far, or am I missing something?

Similarly, I must believe that any action I can take that would change the distribution over action sets of any agent at any time in the future must also do nothing.
That doesn't seem to leave much (or rather it seems to leave nothing in most worlds).

To put it another way, I don't think the intuition works for action-set changes the way it does for decision-given-action-set changes. I can coherently assume that an agent ignores the consequences of my actions in its decision-given-an-action-set, since that only requires I assume something strange about its thinking. I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model.

It's not clear to me how the simulator-of-agents approach helps, but I may just be confused.
Currently the only coherent LCDT agent I can make sense of is trivial.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-04T20:22:26.084Z · LW · GW

Oh and I don't think "LCDT isn't not" isn't not what you meant.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-04T19:58:53.195Z · LW · GW

My first reaction is that it fall in the category of "myopic defection" instead of deception.

Ok, yes - it does seem at least to be a somewhat different issue. I need to think about it more.

In the concrete example, as you say, you would reveal it to any overseer/observer because you don't think anything you do would impact them

Yes, though I think the better way to put this is that I wouldn't spend effort hiding it. It's not clear I'd actively choose to reveal it, since there's no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it's the active efforts to deceive we're most worried about)

If we're talking about a literal LCDT agent (which is what I have in mind), then it would have a learned causal model of HCH good enough to predict what the final output is.

Sure, but the case I'm thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent's causal model is essentially: [data] --> [Argmax HCH function] --> [action].

I assume this isn't what you're thinking of, but it's not clear to me what constraints we'd apply to get the kind of thing you are thinking of. E.g. if our causal model is allowed to represent an individual human as a black-box, then why not HCH as a black-box? If we're not allowing a human as a black-box, then how far must things be broken down into lower-level gears (at fine enough granularity I'm not sure a causal model is much clearer than a NN)?

Quite possibly there are sensible constraints we could apply to get an interpretable model. It's just not currently clear to me what kind of thing you're imagining - and I assume they'd come at some performance penalty.

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-04T19:31:55.789Z · LW · GW

The technical answer is that the LCDT agent computes its distribution over actions spaces for the human by marginalizing the human's current distribution with the LCDT agent distribution over its own action. The intuition is something like: "I believe that the human has already some model of which action I will take, and nothing I can do will change that".

I'm with Steve in being confused how this works in practice.

Let's say I'm an LCDT agent, and you're a human flying a kite.

My action set: [Say "lovely day, isn't it?"] [Burn your kite]
Your action set: [Move kite left] [Move kite right] [Angrily gesticulate]

Let's say I initially model you as having p = 1/3 of each option, based on your expectation of my actions.
Now I decide to burn your kite.
What should I imagine will happen? If I burn it, your kite pointers are dangling.
Do the [Move kite left] and [Move kite right] actions become NOOPs?
Do I assume that my [burn kite] action fails?

I'm clear on ways you could technically say I didn't influence the decision - but if I can predict I'll have a huge influence on the output of that decision, I'm not sure what that buys us. (and if I'm not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)

Comment by Joe_Collman on LCDT, A Myopic Decision Theory · 2021-08-04T15:31:50.142Z · LW · GW

Very interesting. Thanks for writing this up.

Two points, either/both of which may be confusions on my part:

1. What seems to be necessary is that the LCDT thinks its decisions have no influence on the impact of other agents' decisions, not simply on the decisions themselves (this relates to Steve's second point). For example, let's say you're deciding whether to press button A or button B, and I rewire them so that B now has A's consequences, and A B's. I now assume that my action hasn't influenced your decision, but it has influenced the consequences of your decision.
1. The causal graph here has both of us influencing a [buttons] node: I rewire them and you choose which to press. I've cut my link to you, but not to [buttons]. More generally, I can deceive you arbitrarily simply by anticipating your action and applying a post-action-adaptor to it (like re-wiring the buttons).
1. Perhaps the idea here is that I'd have no incentive to hide my interference with the buttons (since I assume it won't change which you press). That seems to work for many cases, and so will be detectable/fixable in training - but after you apply a feedback loop of this sort you'll be left with the action-adaptor-based deceptions which you don't notice.
2. With an LCDT-agent-simulates-Argmax-HCH setup, I'm not clear why "its computation should be fundamentally more understandable than just running a trained model that we searched for acting like HCH". I can buy that the LCDT agent needs to explicitly simulate, but what stops it simulating something equivalent to "a trained model that we searched for...".
1. It seems to me that to get the "...and extract many valuable insights about its behavior", there needs to be an assumption that Argmax-HCH is being simulated in a helpful/clear/transparent way. It's not clear to me why this is expected: wouldn't the same pressures that lead to a "trained model that we searched for acting like HCH" tending to be opaque also lead the simulation of Argmax-HCH to be opaque? Specifically, the LCDT agent only needs to run it, not understand it.
1. Is the idea here that an LCDT agent has particular constraints over the kinds of agents it can model/simulate? That wasn't my impression.

But hopefully I'm missing something!

Comment by Joe_Collman on AXRP Episode 10 - AI’s Future and Impacts with Katja Grace · 2021-07-25T04:04:06.184Z · LW · GW

Very interesting, thanks.

A few points on examples from humans in capacity-to-succeed-through-deception (tricking in the transcript):

1. It's natural that we don't observe anyone successfully doing this, since success entails not being identified as deceptive. This could involve secrecy, but more likely things like charisma and leverage of existing biases.
1. When making comparisons with very-smart-humans, I think it's important to consider very-smart-across-all-mental-dimensions-humans (including charisma etc).
2. It may be that people have paths to high utility (which may entail happiness, enlightenment, meaning, contentment... rather than world domination) that don't involve the risks of a deceptive strategy. If human utility were e.g. linear in material resources, things may look different.
3. Human deception is often kept in check by cost-of-punishments outweighing benefit-of-potential-success. With AI agents the space of meaningful punishments will likely look different.
Comment by Joe_Collman on My Marriage Vows · 2021-07-23T01:06:24.811Z · LW · GW

Ah ok, if the honesty vow takes precedence. I still think it's a difficult one in edge cases, but I don't see effective resolutions that do better than using vows 2 and 3 to decide on those.

I'm not sure what's the difference between "set of vows" and "policy"?

The point isn't in choosing "set of vows" over "policy", but rather in choosing "I make the set of vows..." over "Everything I do will be according to...". You're able to make the set of vows (albeit implicitly), and the vows themselves will have the optimal amount of wiggle-room, achievability, flexibility, emphasis on good faith... built in.

To say "Everything I do will be according to..." seems to set the bar unachievably high, since it just won't be true. You can aim in that direction, but your actions won't even usually be optimal w.r.t. that policy. (thoughts on trying-to-try notwithstanding, I do think vows that are taken seriously should at least be realistically possible to achieve)

To put it another way, to get the "Everything I do..." formulation to be equivalent to the "I make the set of vows..." formulation, I think the former would need to be self-referential - i.e. something like "... according to the policy which is the KS solution... given its inclusion in this vow". That self-reference will insert the optimal degree of wiggle-room etc.

I think you need either the extra indirection or the self-reference (or I'm confused, which is always possible :)).

Comment by Joe_Collman on ($1000 bounty) How effective are marginal vaccine doses against the covid delta variant? · 2021-07-22T14:35:07.851Z · LW · GW It would explain at least a slight efficiency increase: presumably [some collection of factors] (SCoF) influences whether there's a response or not. A priori you'd expect a smaller correlation of SCoF with SCoF-after-8-weeks than with SCoF-after-4-weeks. Presumably the actual impact is larger than this would predict (at least without a better model of SCoF). Comment by Joe_Collman on ($1000 bounty) How effective are marginal vaccine doses against the covid delta variant? · 2021-07-22T02:31:51.962Z · LW · GW

In any context where good faith isn't to be expected (which I'd hope doesn't apply here), bear in mind that there are exploits.

Comment by Joe_Collman on Punishing the good · 2021-07-22T02:00:14.529Z · LW · GW

Agreed. It seems the right move should be to estimate the current net externalities (bearing in mind incentives to hide/publicise the negative/positive), and reward/punish in proportion to that.

Comment by Joe_Collman on My Marriage Vows · 2021-07-21T23:57:11.927Z · LW · GW

Very interesting - and congratulations!

A few thoughts:
It strikes me that the first vow will sometimes conflict with the second. If your idea is that any conflict with the second vow would be a (mild) information hazard, then ok - but I'm not sure what the first vow adds in this case.

Have you considered going meta?:
"I make the set of vows determined by the Kalai-Smorodinski solution to the bargaining problem..."
"...I expect these vows to be something like [original vows go here] but the former description is definitive."

This has the nice upside of automatically catching problems you haven't considered, but not requiring you to be super-human. Specifically, the "Everything I do will be according to the policy..." clause just isn't achievable. Committing to the set of vows such a policy would have you make is achievable (you might not follow them perfectly, but there'd automatically be a balance between achievability and other desiderata).

Comment by Joe_Collman on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-19T19:15:42.741Z · LW · GW

Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for  that:

• We're highly confident is correct.
• Enables the model to decide which true things to output in general. (my (2) here)

It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to "How should I go about designing a chair?").
[Of course it's not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]

I don't think we can remedy this for values questions by including only data that we're certain of. It seems to me that works for facts questions due to the structure of the world: it's so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.

It's not clear that anything analogous works for generalizing preferences (maybe?? but I'd guess not). I'd expect an  trained on [data we're highly confident is correct] to generalize poorly to general open questions.

Similarly, in Paul's setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:

• The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.

It's entirely plausible you can learn "correct language usage" in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement -> natural_language_equivalent] mapping). I don't think it's plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) -> natural_language_answer] mapping).

Again, perhaps I'm (not even) wrong, but I think the above accurately describes my current thinking.

Comment by Joe_Collman on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-18T03:43:12.467Z · LW · GW

Ok, the softer constraints make sense to me, thanks.

Using a debate with  assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with "answering honestly vs predicting human answers" and end up with "judging honestly vs predicting human judgments".

While "Which answer is better, Alice's or Bob's?" is a closed question, learning to answer the general case still requires applying a full model of human values - so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I'm not really sure what we'd mean by an intended model for the judge).

But perhaps I'm missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn't have analogous issues?

Comment by Joe_Collman on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-18T03:14:23.945Z · LW · GW

Ok, I think that makes some sense in so far as you're softening the  constraint and training it in more open-ended conditions. I'm not currently clear where this gets us, but I'll say more about that in my response to Paul.

However, I don't see how you can use generalization from the kind of dataset where  and  always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]
I see honestly answering a question as a 2-step process (conceptually):
1) Decide which things are true.
2) Decide which true thing to output.

In the narrow case, we're specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn't learned anything that can generalize to (2).
Step (2) is in part a function of human values, so we'd need to be giving it some human-values training signal for it to generalize.

[EDIT: I've just realized that I'm being very foolish here. The above suggests that learning (1) doesn't necessarily generalize to (2). In no way does it imply that it can't. I think the point I want to make is that an   that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I'm implicitly claiming that doing (2) well in general requires full understanding of human values)]

Overall, I'm still unsure how to describe what we want: clearly we don't trust Alice's answers if she's being blackmailed, but how about if she's afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?
It's clear that the instrumental model just gives whatever response Alice would give here.
I don't know what the intended model should do; I don't know what "honest answer" we're looking for.

If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.

Comment by Joe_Collman on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-14T22:01:30.607Z · LW · GW

Thanks for writing this up. It is useful to see a non-Paul perspective on the same ideas, both in terms of clarifying the approach, and eliminating a few of my confusions.

A typo: After "or defined in my notation as", you have  twice rather than

I've not yet been through the details, but it'd be helpful if you'd clarify the starting point and scope a little, since I may well be misunderstanding you (and indeed Paul). In particular on this:

Specifically,  is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding  as a logical statement and unembedding its answer in .

My immediate thought is that in general question answering there is no unique honest unembedding. Much of answer formation is in deciding which information is most relevant, important, useful, tacitly assumed... (even assuming fixed world model and fixed logical deductions).
So I assume that you have to mean a narrower context where e.g. the question specifies the logical form the answer must take and the answering human/model assigns values to pre-defined variables.

For a narrower setting, the gist of the post makes sense to me - but I don't currently see how a solution there would address the more general problem. Is finding a prior that works for closed questions with unique honest answers sufficient?

The more general setting seems difficult as soon as you're asking open questions.
If you do apply the  constraint there, then it seems   must do hugely more than a simple unembedding from deductions. It'll need to robustly select the same answer as a human from a huge set of honest answers, which seems to require something equivalent to predicting the human. At that point it's not clear to me when exactly we'd want   to differ from  in its later answers (there exist clear cases; I don't see a good general rule, or how you'd form a robust dataset to learn a rule).
To put it another way, [honest output to q from fixed world model] doesn't in general uniquely define an answer until you know what the answerer believes the asker of q values.

Apologies if I'm stating the obvious: I'm probably confused somewhere, and wish to double-check my 'obvious' assumptions. Clarifications welcome.

Comment by Joe_Collman on Teaching ML to answer questions honestly instead of predicting human answers · 2021-06-01T22:03:21.113Z · LW · GW

Ok, that all makes sense, thanks.

...and is-correct basically just tests whether they are equal.

So here "equal" would presumably be "essentially equal in the judgement of complex process", rather than verbatim equality of labels (the latter seems silly to me; if it's not silly I must be missing something).

Comment by Joe_Collman on Teaching ML to answer questions honestly instead of predicting human answers · 2021-05-28T21:40:07.919Z · LW · GW

Very interesting, thanks.

Just to check I'm understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.

I.e. the labelling process would be doing something like this:

# Return a pair of (simple, complex) labels for a given input

simple_label = GenerateSimpleLabel(input)

if is_correct(simple_label, input):
return simple_label, simple_label
else:
return simple_label, GenerateComplexLabel(input)

Does that make sense?

A couple of typos:
"...we are only worried if the model [understands? knows?] the dynamics..."
"...don’t collect training data in situations without [where?] strong adversaries are trying..."

Comment by Joe_Collman on AMA: Paul Christiano, alignment researcher · 2021-05-03T05:14:43.741Z · LW · GW

Thanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving.

A few immediate thoughts (mainly for clarification; not sure anything here merits response):

• I had been thinking too much lately of [isolated human] rather than [human process].
• I agree the issue I want to point to isn't precisely OOD generalisation. Rather it's that the training data won't be representative of the thing you'd like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I'm worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
• It does seem hard to ensure you don't end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner's resource levels or motives.
• The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
• If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
• W.r.t. H's making value calls, my worry isn't that they're asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).

I'm going to try writing up the core of my worry in more precise terms.
It's still very possible that any non-trivial substance evaporates under closer scrutiny.

Comment by Joe_Collman on AMA: Paul Christiano, alignment researcher · 2021-04-29T04:45:58.598Z · LW · GW

I'd be interested in your thoughts on human motivation in HCH and amplification schemes.
Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?

Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.
[I don't think you've addressed this at all recently - I've only come across specifying enlightened judgement precisely]

I'd appreciate if you could say if/where you disagree with the following kind of argument.
I'd like to know what I'm missing:

Motivation seems like an eventual issue for imitative amplification. Even for an H who always attempted to give good direct answers to questions in training, the best models at predicting H's output would account for differing levels of enthusiasm, focus, effort, frustration... based in part on H's attitude to the question and the opportunity cost in answering it directly.

The 'correct' (w.r.t. alignment preservation) generalisation must presumably be in all circumstances to give the output that H would give. In scenarios where H wouldn't directly answer the question (e.g. because H believed the value of answering the question were trivial relative to opportunity cost), this might include deception, power-seeking etc. Usually I'd expect high value true-and-useful information unrelated to the question; deception-for-our-own-good just can't be ruled out.

If a system doesn't always adapt to give the output H would, on what basis do we trust it to adapt in ways we would endorse? It's unclear to me how we avoid throwing the baby out with the bathwater here.

Or would you expect to find Hs for whom such scenarios wouldn't occur? This seems unlikely to me: opportunity cost would scale with capability, and I'd predict every H would have their price (generally I'm more confident of this for precisely the kinds of H I'd want amplified: rational, altruistic...).

If we can't find such Hs, doesn't this at least present a problem for detecting training issues?: if HCH may avoid direct answers or deceive you (for worthy-according-to-H reasons), then an IDA of that H eventually would too. At that point you'd need to distinguish [benign non-question-related information] and [benevolent deception] from [malign obfuscation/deception], which seems hard (though perhaps no harder than achieving existing oversight desiderata???).

Even assuming that succeeds, you wouldn't end up with a general-purpose question-answerer or task-solver: you'd get an agent that does whatever an amplified [model predicting H-diligently-answering-training-questions] thinks is best. This doesn't seem competitive across enough contexts.

...but hopefully I'm missing something.

Comment by Joe_Collman on Auctioning Off the Top Slot in Your Reading List · 2021-04-15T00:26:41.511Z · LW · GW

That's a good point, though I do still think you need the right motivation. Where you're convinced you're right, it's very easy to skim past passages that are 'obviously' incorrect, and fail to question assumptions.
(More generally, I do wonder what's a good heuristic for this - clearly it's not practical to constantly go back to first principles on everything; I'm not sure how to distinguish [this person is applying a poor heuristic] from [this person is applying a good heuristic to very different initial beliefs])

Perhaps the best would be a combination: a conversation which hopefully leaves you with the thought that you might be wrong, followed by the book to allow you to go into things on your own time without so much worry over losing face or winning.

Another point on the cause-for-optimism side is that being earnestly interested in knowing the truth is a big first step, and I think that description fits everyone mentioned so far.

Comment by Joe_Collman on Auctioning Off the Top Slot in Your Reading List · 2021-04-14T22:11:26.851Z · LW · GW

I'd guess that reciprocal exchanges might work better for friends:
I'll read any m books you pick, so long as you read the n books I pick.

Less likely to get financial ick-factor, and it's always possible that you'll gain from reading the books they recommend.

Perhaps this could scale to public intellectuals where there's either a feeling of trust or some verification mechanism (e.g. if the intellectual wants more people to read [some neglected X], and would willingly trade their time reading Y for a world where X were more widely appreciated).

Whether or not money is involved, I'm sceptical of the likely results for public intellectuals - or in general for people strongly attached to some viewpoint. The usual result seems to be a failure to engage with the relevant points. (perhaps not attacking head-on is the best approach: e.g. the asymmetrical weapons post might be a good place to start for Deutsch/Pinker)

Specifically, I'm thinking of David Deutsch speaking about AGI risk with Sam Harris: he just ends up telling a story where things go ok (or no worse than with humans), and the implicit argument is something like "I can imagine things going ok, and people have been incorrectly worried about things before, so this will probably be fine too". Certainly Sam's not the greatest technical advocate on the AGI risk side, but "I can imagine things going ok..." is a pretty general strategy.

The same goes for Steven Pinker, who spends nearly two hours with Stuart Russell on the FLI podcast, and seems to avoid actually thinking in favour of simply repeating the things he already believes. There's quite a bit of [I can imagine things going ok...], [People have been wrong about downsides in the past...], and [here's an argument against your trivial example], but no engagement with the more general points behind the trivial example.

Steven Pinker has more than enough intelligence to engage properly and re-think things, but he just pattern-matches any AI risk argument to [some scary argument that the future will be worse] and short-circuits to enlightenment-now cached thoughts. (to be fair to Steven, I imagine doing a book tour will tend to set related cached thoughts in stone, so this is a particularly hard case... but you'd hope someone who focuses on the way the brain works would realise this danger and adjust)

When you're up against this kind of pattern-matching, I don't think even the ideal book is likely to do much good. If two hours with Stuart Russell doesn't work, it's hard to see what would.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-04-13T17:39:44.412Z · LW · GW

Unless I've confused myself badly (always possible!), I think either's fine here. The | version just takes out a factor that'll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+))

Since we'll renormalise, common factors don't matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-04-12T19:59:25.802Z · LW · GW

Taking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument".

Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range.

On (1), I agree that the same goes for pretty-much any argument: that's why it's important. If you update without factoring in (some approximation of) your best judgement of the evidence's impact on all hypotheses, you're going to get the wrong answer. This will depend highly on your underlying model.

On the information content of the post, I'd say it's something like "12 OOMs is probably enough (without things needing to scale surprisingly well)". My credence for low OOM values is mostly based on worlds where things scale surprisingly well.

But this is a bit weird; my post didn't talk about the <7 range at all, so why would it disproportionately rule out stuff in that range?

I don't think this is weird. What matters isn't what the post talks about directly - it's the impact of the evidence provided on the various hypotheses. There's nothing inherently weird about evidence increasing our credence in [TAI by +10OOM] and leaving our credence in [TAI by +3OOM] almost unaltered (quite plausibly because it's not too relevant to the +3OOM case).

Compare the 1-2-3 coins example: learning y tells you nothing about the value of x. It's only ruling out any part of the 1 outcome in the sense that it maintains [x_heads & something independent is heads], and rules out [x_heads & something independent is tails]. It doesn't need to talk about x to do this.

You can do the same thing with the TAI first at k OOM case - call that Tk. Let's say that your post is our evidence e and that e+ stands for [e gives a compelling argument against T13+].
Updating on e+ you get the following for each k:
Initial hypotheses: [Tk & e+], [Tk & e-]
Final hypothesis: [Tk & e+]

So what ends up mattering is the ratio p[Tk | e+] : p[Tk | e-]
I'm claiming that this ratio is likely to vary with k.

Specifically, I'd expect T1 to be almost precisely independent of e+, while I'd expect T8 to be correlated. My reason on the T1 is that I think something radically unexpected would need to occur for T1 to hold, and your post just doesn't seem to give any evidence for/against that.
I expect most people would change their T8 credence on seeing the post and accepting its arguments (if they've not thought similar things before). The direction would depend on whether they thought the post's arguments could apply equally well to ~8 OOM as 12.

Note that I am assuming the argument ruling out 13+ OOM is as in the post (or similar).
If it could take any form, then it could be a more or less direct argument for T1.

Overall, I'd expect most people who agree with the post's argument to update along the following lines (but smoothly):
T0 to Ta: low increase in credence
Ta to Tb: higher increase in credence
Tb+: reduced credence

with something like (0 < a < 6) and (4 < b < 13).
I'm pretty sure a is going to be non-zero for many people.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-04-04T21:36:43.909Z · LW · GW

[[ETA, I'm not claiming the >12 OOM mass must all go somewhere other than the <4 OOM case: this was a hypothetical example for the sake of simplicity. I was saying that if I had such a model (with zwomples or the like), then a perfectly good update could leave me with the same posterior credence on <4 OOM.
In fact my credence on <4 OOM was increased, but only very slightly]]

First I should clarify that the only point I'm really confident on here is the "In general, you can't just throw out the >12 OOM and re-normalise, without further assumptions" argument.

I'm making a weak claim: we're not in a position of complete ignorance w.r.t. the new evidence's impact on alternate hypotheses.

My confidence in any specific approach is much weaker: I know little relevant data.

That said, I think the main adjustment I'd make to your description is to add the possibility for sublinear scaling of compute requirements with current techniques. E.g. if beyond some threshold meta-learning efficiency benefits are linear in compute, and non-meta-learned capabilities would otherwise scale linearly, then capabilities could scale with the square root of compute (feel free to replace with a less silly example of your own).

This doesn't require "We'll soon get more ideas" - just a version of "current methods scale" with unlucky (from the safety perspective) synergies.

So while the "current methods scale" hypothesis isn't confined to 7-12 OOMs, the distribution does depend on how things scale: a higher proportion of the 1-6 region is composed of "current methods scale (very) sublinearly".

My p(>12 OOM | sublinear scaling) was already low, so my p(1-6 OOM | sublinear scaling) doesn't get much of a post-update boost (not much mass to re-assign).
My p(>12 OOM | (super)linear scaling) was higher, but my p(1-6 OOM | (super)linear scaling) was low, so there's not too much of a boost there either (small proportion of mass assigned).

I do think it makes sense to end up with a post-update credence that's somewhat higher than before for the 1-6 range - just not proportionately higher. I'm confident the right answer for the lower range lies somewhere between [just renormalise] and [don't adjust at all], but I'm not at all sure where.

Perhaps there really is a strong argument that the post-update picture should look almost exactly like immediate renormalisation. My main point is that this does require an argument: I don't think its a situation where we can claim complete ignorance over impact to other hypotheses (and so renormalise by default), and I don't think there's a good positive argument for [all hypotheses will be impacted evenly].

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-03-31T19:57:30.019Z · LW · GW

Yes, we're always renormalising at the end - it amounts to saying "...and the new evidence will impact all remaining hypotheses evenly". That's fine once it's true.

I think perhaps I wasn't clear with what I mean by saying "This doesn't say anything...".
I meant that it may say nothing in absolute terms - i.e. that I may put the same probability of [TAI at 4 OOM] after seeing the evidence as before.

This means that it does say something relative to other not-ruled-out hypotheses: if I'm saying the new evidence rules out >12 OOM, and I'm also saying that this evidence should leave p([TAI at 4 OOM]) fixed, I'm implicitly claiming that the >12 OOM mass must all go somewhere other than the 4 OOM case.

Again, this can be thought of as my claiming e.g.:
[TAI at 4 OOM] will happen if and only if zwomples work
There's a 20% chance zwomples work
The new 12 OOM evidence says nothing at all about zwomples

In terms of what I actually think, my sense is that the 12 OOM arguments are most significant where [there are no high-impact synergistic/amplifying/combinatorial effects I haven't thought of].
My credence for [TAI at < 4 OOM] is largely based on such effects. Perhaps it's 80% based on some such effect having transformative impact, and 20% on we-just-do-straightforward-stuff. [Caveat: this is all just ottomh; I have NOT thought for long about this, nor looked at much evidence; I think my reasoning is sound, but specific numbers may be way off]

Since the 12 OOM arguments are of the form we-just-do-straightforward-stuff, they cause me to update the 20% component, not the 80%. So the bulk of any mass transferred from >12 OOM, goes to cases where p([we-just-did-straightforward-stuff and no strange high-impact synergies occurred]|[TAI first occurred at this level]) is high.

Comment by Joe_Collman on Conspicuous saving · 2021-03-31T14:43:34.081Z · LW · GW

It's not entirely clear to me either.
Here are a few quick related thoughts:

1. We shouldn't assume it's clear that higher-long-term QoL is the primary motivator for most people who do save. For most of them, it's something their friends, family, co-workers... think is a good idea.
2. Evolutionary fitness doesn't care (directly) about QoL.
3. There may be unhelpful game theory at work. If in some groups where people tend to spend X, there's quite a bit to gain in spending [X + 1], and a significant loss in spending [X - 1], you'd expect group spending to increase.
4. Even if we're talking about [what's effective] rather than [our evolutionary programming], we're still navigating other people's evolutionary programming. Being slightly above/below average in spending may send a disproportionate signal.
5. The value of a faked signal is higher for people who don't have other channels to signal something similar.
6. Other groups likely are sending similar signals in other ways. E.g. consider intellectuals sitting around having lengthy philosophical discussions that don't lead to action. They're often wasting time, simultaneously showing off skills that they could be using more productively, but aren't. (this is also a problem where it's a genuine waste - my point is only that very few people avoid doing this in some form)

Of course none of this makes it any less of a problem (to the extent it's bringing down collective QoL) - but possibly a difficult problem that we'd expect to exist.

Solutions-wise, my main thought is that you'd want to find a way to channel signalling-waste efficiently into public goods - so that personal 'waste' becomes a collective advantage (hopefully).

It is also worth noting that not all 'wasteful' spending is bad for society. E.g. consider early adopters of new and expensive technology: without people willing to 'waste' money on the Tesla Roadster, getting electric cars off the ground may have been a much harder problem.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-03-30T13:09:40.509Z · LW · GW

We do gain evidence on at least some alternatives, but not on all the factors which determine the alternatives. If we know something about those factors, we can't usually just renormalise. That's a good default, but it amounts to an assumption of ignorance.

Here's a simple example:
We play a 'game' where you observe the outcome of two fair coin tosses x and y.
You score:
2 if x is tails and y is heads
3 if x is tails and y is tails

So your score predictions start out at:
1 : 50%
2 : 25%
3 : 25%

We look at y and see that it's heads. This rules out 3.
Renormalising would get us:
1 : 66.7%
2 : 33.3%
3: 0%

This is clearly silly, since we ought to end up at 50:50 - i.e. all the mass from 3 should go to 2. This happens because the evidence that falsified 3 points was insignificant to the question "did you score 1 point?".
On the other hand, if we knew nothing about the existence of x or y, and only knew that we were starting from (1: 50%, 2: 25%, 3: 25%), and that 3 had been ruled out, it'd make sense to re-normalise.

In the TAI case, we haven't only learned that 12 OOM is probably enough (if we agree on that). Rather we've seen specific evidence that leads us to think 12 OOM is probably enough. The specifics of that evidence can lead us to think things like "This doesn't say anything about TAI at +4 OOM, since my prediction for +4 is based on orthogonal variables", or perhaps "This makes me near-certain that TAI will happen by +10 OOM, since the +12 OOM argument didn't require more than that".

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-03-29T18:26:56.634Z · LW · GW

If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths.

This depends on the mechanism by which you assigned the mass initially - in particular, whether it's absolute or relative. If you start out with specific absolute probability estimates as the strongest evidence for some hypotheses, then you can't just renormalise when you falsify others.

E.g. consider we start out with these beliefs:
If [approach X] is viable, TAI will take at most 5 OOM; 20% chance [approach X] is viable.
If [approach X] isn't viable, 0.1% chance TAI will take at most 5 OOM.
30% chance TAI will take at least 13 OOM.

We now get this new information:
There's a 95% chance [approach Y] is viable; if [approach Y] is viable TAI will take at most 12 OOM.

We now need to reassign most of the 30% mass we have on >13 OOM, but we can't simply renormalise: we haven't (necessarily) gained any information on the viability of [approach X].
Our post-update [TAI <= 5OOM] credence should remain almost exactly 20%. Increasing it to ~26% would not make any sense.

For AI timelines, we may well have some concrete, inside-view reasons to put absolute probabilities on contributing factors to short timelines (even without new breakthroughs we may put absolute numbers on statements of the form "[this kind of thing] scales/generalises"). These probabilities shouldn't necessarily be increased when we learn something giving evidence about other scenarios. (the probability of a short timeline should change, but in general not proportionately)

Perhaps if you're getting most of your initial distribution from a more outside-view perspective, then you're right.

Comment by Joe_Collman on Conspicuous saving · 2021-03-21T22:17:05.519Z · LW · GW

Oh I'm not claiming that non-wasted wealth signalling is useless. I'm saying that frivolous spending and saving send very different signals - and that saving doesn't send the kind of signal tEitB focuses on.

Whether a public saving-signal would actually help is an empirical question. My guess is that it wouldn't help in most contexts where unwise spending is currently the norm, since I'd expect it to signal lack of ability/confidence. Of course I may be wrong.

When considering status, I think wealth is largely valued as an indirect signal of ability (in a broad sense). E.g. compare getting \$100m by founding a business vs winning a lottery. The lottery winner gets nothing like the status bump that the business founder gets.
This is another reason I think spending sends a stronger overall signal in many contexts: it (usually) says both [I had the ability to get this wealth] and [I have enough ability that I expect not to need it].

Comment by Joe_Collman on Conspicuous saving · 2021-03-21T14:40:28.967Z · LW · GW

This is interesting, but I think largely misses the point that elephant-in-the-brain-style signalling is often about sending the signal "I can afford to waste resources, because I've got great confidence in my ability to do well in the future without them".

Saving just doesn't achieve this - it achieves the opposite:
"Look at all my savings; I can't afford to waste any resources, since I have little confidence in my ability to do well in the future without them".

It makes evolutionary sense to signal ability rather than resources, since resources can't be passed on (until very recently, at least), and won't necessarily translate to all situations. By wasting resources, you're signalling your confidence you'll do well whatever the future throws at you.

If you want a signalling approach that improves the world, I think it has to be conspicuous donation, not conspicuous saving.

Comment by Joe_Collman on Voting-like mechanisms which address size of preferences? · 2021-03-19T01:09:19.858Z · LW · GW

Very interesting - I'll give some thought to answers, but for now a quick cached-thought comment:

A proposed solution: bills cannot be contradicted by bills which pass with less votes.

I don't think this is practical as a full solution to this problem, since a bill doesn't need to explicitly contradict a previous bill in order to make the previous one irrelevant.

You've made foobles legal? We'll require fooble licenses costing two years' training and a million dollars.
You've banned smarbling? We'll switch all resources from anti-smarbling enforcement to crack down on unlicensed foobles.

Of course you could craft the fooble/smarbling laws to avoid these pitfalls, but there's more than one way to smarble a fooble.

Comment by Joe_Collman on Strong Evidence is Common · 2021-03-17T23:09:26.650Z · LW · GW

Sure, but what I mean is that this is hard to do for hypothesis-location, since post-update you still have the hypothesis-locating information, and there's some chance that your "explaining away" was itself incorrect (or your memory is bad, you have bugs in your code...).

For an extreme case, take Donald's example, where the initial prior would be 8,000,000 bits against.
Locating the hypothesis there gives you ~8,000,000 bits of evidence. The amount you get in an "explaining away" process is bounded by your confidence in the new evidence. How sure are you that you correctly observed and interpreted the "explaining away" evidence? Maybe you're 20 bits sure; perhaps 40 bits sure. You're not 8,000,000 bits sure.

Then let's say you've updated down quite a few times, but not yet close to the initial prior value. For the next update, how sure are you that the stored value that you'll be using as your new prior is correct? If you're human, perhaps you misremembered; if a computer system, perhaps there's a bug...
Below a certain point, the new probability you arrive at will be dominated by contributions from weird bugs, misrememberings etc.
This remains true until/unless you lose the information describing the hypothesis itself.

I'm not clear how much this is a practical problem - I agree you can update the odds of a hypothesis down to no-longer-worthy-of-consideration. In general, I don't think you can get back to the original prior without making invalid assumptions (e.g. zero probability of a bug/hack/hoax...), or losing the information that picks out the hypothesis.

Comment by Joe_Collman on Strong Evidence is Common · 2021-03-16T22:30:15.026Z · LW · GW

It's worth noting that most of the strong evidence here is in locating the hypothesis.
That doesn't apply to the juggling example - but that's not so much evidence. "I can juggle" might take you from 1:100 to 10:1. Still quite a bit, but 10 bits isn't 24.

I think this relates to Donald's point on the asymmetry between getting from exponentially small to likely (commonplace) vs getting from likely to exponentially sure (rare). Locating a hypothesis can get you the first, but not the second.

It's even hard to get back to exponentially small chance of x once it seems plausible (this amounts to becoming exponentially sure of ¬x). E.g., if I say "My name is Mortimer Q. Snodgrass... Only kidding, it's actually Joe Collman", what are the odds that my name is Mortimer Q. Snodgrass? 1% perhaps, but it's nowhere near as low as the initial prior.
The only simple way to get all the way back is to lose/throw-away the hypothesis-locating information - which you can't do via a Bayesian update. I think that's what makes privileging the hypothesis such a costly error: in general you can't cleanly update your way back (if your evidence, memory and computation were perfectly reliable, you could - but they're not). The way to get back is to find the original error and throw it out.

How difficult is it to get into the top 1% of traders? To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get.

I don't think your examples say much about this. They're all of the form [trusted-in-context source] communicates [unlikely result]. They don't seem to show a reason to expect strong evidence may be easy to get when this pattern doesn't hold. (I suppose they do say that you should check for the pattern - and probably it is useful to occasionally be told "There may be low-hanging fruit. Look for it!")

Comment by Joe_Collman on Anna and Oliver discuss Children and X-Risk · 2021-03-15T21:47:01.849Z · LW · GW

I hope you find the time. I hadn't realised this was happening and would be interested in any thoughts and ideas. It's an issue that's high impact, broadly relevant, hard to get good data and easy to reason poorly about - so I'm glad to see some thoughtful discussion.

Comment by Joe_Collman on AstraZeneca COVID Vaccine and blood clots · 2021-03-15T15:19:03.631Z · LW · GW

Is there a source that shows there's even a correlation? Please link one if there is - perhaps I missed it. The reports I've seen don't suggest any - e.g. bmj report.

From what (little) I've seen, this seems to be evidence for the hypothesis "Anecdotes frequently cause officials with bad incentives to make harmful decisions".

Comment by Joe_Collman on AstraZeneca COVID Vaccine and blood clots · 2021-03-15T15:14:01.295Z · LW · GW

Is there a source that shows there's even a correlation? Please link one if there is - perhaps I missed it. The reports I've seen don't suggest any - e.g. bmj report.

From what (little) I've seen, this seems to be evidence for the hypothesis "Anecdotes frequently cause officials with bad incentives to make harmful decisions".

Comment by Joe_Collman on Trapped Priors As A Basic Problem Of Rationality · 2021-03-14T00:46:32.860Z · LW · GW

I think it's important to distinguish between:

1) Rationality as truth-seeking.
2) Rationality as utility maximization.

For some of the examples these will go together. For others, moving closer to the truth may be a utility loss - e.g. for political zealots whose friends and colleagues tend to be political zealots.

It'd be interesting to see a comparison between such cases. At the least, you'd want to vary the following:

Having a very high prior on X's being true.
Having a strong desire to believe X is true.
Having a strong emotional response to X-style situations.
The expected loss/gain in incorrectly believing X to be true/false.

Cultists and zealots will often have a strong incentive to believe some X even if it's false, so it's not clear the high prior is doing most/much of the work there.

With trauma-based situations, it also seems particularly important to consider utilities: more to lose in incorrectly thinking things are safe, than in incorrectly thinking they're dangerous.
When you start out believing something's almost certainly very dangerous, you may be right. For a human, the utility-maximising move probably is to require more than the 'correct' amount of evidence to shift your belief (given that you're impulsive, foolish, impatient... and so can't necessarily be trusted to act in your own interests with an accurate assessment).

It's also worth noting that habituation can be irrational. If you're repeatedly in a situation where there's good reason to expect a 0.1% risk of death, but nothing bad happens the first 200 times, then you'll likely habituate to under-rate the risk - unless your awareness of the risk makes the experience of the situation appropriately horrifying each time.

On polar bears vs coyotes:

I don't think it's reasonable to label the ...I saw a polar bear... sensation as "evidence for bear". It's weak evidence for bear. It's stronger evidence for the beginning of a joke. For [polar bear] the [earnest report]:[joke] odds ratio is much lower than for [coyote].

I don't think you need to bring in any irrational bias to get this result. There's little shift in belief since it's very weak evidence.

If your friend never makes jokes, then the point may be reasonable. (in particular, for your friend to mistakenly earnestly believe she saw a polar bear, it's reasonable to assume that she already compensated for polar-bear unlikeliness; the same doesn't apply if she's telling a joke)