Posts
Comments
Often the effect of being blinded is that you take suboptimal actions. As you pointed out in your example, if you see the problem then all sorts of cheap ways to reduce the harmful impact occur to you. So perhaps one way of getting to the issue could be to point at that: "I know you care about my feelings, and it wouldn't have made this meeting any less effective to have had it more privately, so I'm surprised that you didn't"?
Wireheading traps.
An agent is "wireheading" if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its "main" utility function or goals.
People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, "what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?".
But we can also use this as a guard-rail.
A "wireheading trap" is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.
An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities "fails closed", because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like "etch 'Bill is a poo poo head' in 10m high letters into Mt Everest". Very hard if you don't have the ability to affect the physical world, but if you have nanotech... why bother melting humanity when you can just deface Mt Everest and be done with it?
Obvious problems:
- Convergent instrumental goals. We don't want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you've wireheaded you don't care if you're stopped later?), but this has problems.
- If you make it too attractive the AI won't even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.
Overall very half-baked, but I wonder if there's something to be done in the general area of "have the AI behave in a way that neuters it, but only when its capabilities increase".
We have trained ML systems to play games, what if we trained one to play a simplified version of the "I'm an AI in human society" game?
Have a population of agents with preferences, the AI is given some poorly specified goal, it has the ability to expand its capabilities etc. You might expect to observe things like a "treacherous turn".
If we could do that it would be quite the scary headline "Researchers simulate the future with AI and it kills us all". Not proof, but perhaps viral and persuasive.
I think I would argue that harm/care isn't obviously deontological. Many of the others are indeed about the performance of the action, but I think arguably harm/care is actually about the harm. There isn't an extra term for "and this was done by X".
That might just be me foisting my consequentialist intuitions on people, though.
"What if there's an arms race / race to the bottom in persuasiveness, and you have to pick up all the symmetrical weapons others use and then use asymmetrical weapons on top of those?"
Doesn't this question apply to other cases of symmetric/asymmetric weapons just as much?
I think the argument is that you want to try and avoid the arms race by getting everyone to agree to stick to symmetrical weapons because they believe it'll benefit them (because they're right). This may not work if they don't actually believe they're right and are just using persuasion as a tool, but I think it's something we could establish as a community norm in restricted circles at least.
The point that the Law needs to be simple and local so that humans can cope with it is also true of other domains. And this throws up an important constraint for people designing systems that humans are supposed to interact with: you must make it possible to reason simply and locally about them.
This comes up in programming (to a man with a nail everything looks like a hammer): good programming practice emphasises splitting programs up into small components that can be reasoned about in isolation. Modularity, compositionality, abstraction, etc. aside from their other benefits, make it possible to reason about code locally.
Of course, some people inexplicably believe that programs are mostly supposed to be consumed by computers, which have very different simplicity requirements and don't care much about locality. This can lead to programs that are very difficult for humans to consume.
Similarly, if you are writing a mathematical proof, it is good practice to try and split it up into small lemmas, transform the domain with definitions to make it simpler, and prove sub-components in isolation.
Interestingly, these days you can also write mathematical proofs to be consumed by a computer. And these often suffer some of the same problems that computer programs do - because what is simple for the computer does not necessarily correspond to what is simple for the human.
(Tendentious speculation: perhaps it is not a coincidence that mathematicians tend to gravitate towards functional programming.)
I am reminded of Guided by the Beauty of our Weapons. Specifically, it seems like we want to encourage forms of rhetoric that are disproportionately persuasive when deployed by someone who is in fact right.
Something like "make the structure of your argument clear" is probably good (since it will make bad arguments look bad), "use vivid examples" is unclear (can draw people's attention to the crux of your argument, or distract from it), "tone and posture" are probably bad (because the effect is symmetrical).
So a good test is "would this have an equal effect on the persuasiveness of my speech if I was making an invalid point?". If the answer is no, then do it; otherwise maybe not.
Yes, this is very annoying.
I found Kevin Simmler's observation that an apology is a status lowering to be very helpful. In particular, it gives you a good way to tell if you made an apology properly - do you feel lower status?
I think that even if you take the advice in this post you can make non-apologies if you don't manage to make yourself lower your own status. Bits of the script that are therefore important:
- Being honest about the explanation, especially if it's embarassing.
- Emphasise explanations that attribute agency to you - "I just didn't think about it" is bad for this reason.
- Not being too calm and clinical about the process - this suggests that it's unimportant.
This also means that weird dramatic stuff can be good if it actaully makes you lower your status. If falling to your knees and embracing the person's legs will be perceived as lowering your status rather than funny, then maybe that will help.
This is a great point. I think this can also lead to cognitive dissonance: if you can predict that doing X will give you a small chance of doing Y, then in some sense it's already in your choice set and you've got the regret. But if you can stick your fingers in your ears enough and pretend that X isn't possible, then that saves you from the regret.
Possible values of X: moving, starting a company, ending a relationship. Scary big decisions in general.
Something that confused me for a bit: people use regret-minimization to handle exporation-exploitation problems, shouldn't they have noticed a bias against exploration? I think the answer here is that the "exploration" people usually think about involves taking an already known option to gain more information about it, not actually expanding the choice set. I don't know of any framework that includes actions that actually change the choice set.
I've read it shallowly, and I think it's generally good. I think I'll have some more comments after I've thought about it a bit more. I'm surprised either by the lack of previous quantitative models, or the lack of reference to them (which is unsurprising if they don't exist!). Is there really nothing prior to this?
I would dearly, dearly love to be able to use the fairly-standard Markdown footnote extension.
I think your example won't work, but it depends on the implementation of FHE. If there's a nonce involved (which there really should be), then you'll get different encrypted data for the output of the two programs you run, even though the underlying data is the same.
But you don't actually need to do that. The protocol lets B exfiltrate one bit of data, whatever bit they like. A doesn't get to validate the program that B runs, they can only validate the output. So any program that produces 0 or 1 will satisfy A and they'll even decrypt the output for you.
That does indeed mean that B can find out if A is blackmailable, or something, so exposing your source code is still risky. What would be really cool would be a way to let A also be sure what program has been run on their source by B, but I couldn't think of a way to do this such that both A and B are sure that the program was the one that actually got run.
I haven't read Age of Em, but something like "spur safes" was an inspiration (I'm sure I've come across the idea before). My version is similar except that
It's stripped down.
B only needs to make a Validator, which could be a copy of themself, but doesn't have to be.
It only validates A to B, rather than trying to do both simultaneously. You can of course just run it twice in both directions.
You don't need a trusted computing environment.
I think that's a pretty big deal, because the trusted computing environment has to be trusted enough to run its end of A/B's secure channels. In order for A/B to trust the output, it would need to e.g. be signed by their private keys, but then the computing envionment has access to those keys and can do whatever it wants! The trick with FHE is to let B run a computation using their secret key "inside" the safe without letting anyone else see the key.
Pretty much! Expanding your explanation a little:
A sends msg_1 = Encrypt(A_source, A_key), and sends that to B
B wants to run Validate(source) = Sign(Check_trustworthy(source), B_key) on A_source, but can't do that directly because B only has an encrypted version.
So B runs Validate under FHE on msg_1, producing msg_2 = Encrypt(Validate(A_source), A_key), and sends that to A.
A decrypts msg_2, producing msg_3 = Validate(A_source) = Sign(Check_trustworthy(A_source), B_key), and sends that back to B (if it meets the agreed-on format).
B has a claim that A's source is trustworthy, signed by B's key, which A can't have, so it must have been produced by B's program.
Step 2.1 is where the magic happens.
(I should have just put this in the post!)
Fantastic post, I think this is right on the money.
Many more Newcomblike scenarios simply don't feel like decision problems: people present ideas to us in specific ways (depending upon their model of how we make choices) and most of us don't fret about how others would have presented us with different opportunities if we had acted in different ways.
I think this is a big deal. Part of the problem is that the decision point (if there was anything so firm) is often quite temporally distant from the point at which the payoff happens. The time when you "decide" to become unreliable (or the period in which you become unreliable) may be quite a while before you actually feel the ill effects of being unreliable.
You cannot possibly gain new knowledge about physics by doing moral philosophy.
This seems untrue. If you have high credence in the two premisses:
- If X were a correct physical theory, then Y.
- Not Y.
then that should decrease your credence in X. It doesn't matter whether Y is a proposition about the behaviour of gases or about moral philosophy (although the implication is likely to be weaker in the latter case).
Constructivist logic works great if you interpret it as saying which statements can be proven, or computed, but I would say it doesn't hold up when interpreted as showing which statements are true (given your axioms). It's therefore not really appropriate for mathematics, unless you want to look at mathematics in the light of its computational or proof-theoretic properties.
Dialethists requires paraconsistent logic, as you have to be able to reason in the presence of contradictions, but paraconsitent logic can be used to model other things than truth. For example, constructive logic is often given the semantics of showing what statements can be proven, rather than what statements are true. There are similar interpretations for paraconsistent logic.
OTOH, if you think that paraconsistent logic is the correct logic for truth, then you probably do have to be a dialethist.
That's pretty weird, considering that so-called "sophisticated" consequentialist theories (where you can say something like: although in this instance it would be better for me to do X than Y, overall it would be better to have a disposition to do Y than X, so I shall have such a disposition) have been a huge area of discussion recently. And yes, it's bloody obvious and it's a scandal it took so long for these kinds of ideas to get into contemporary philosophy.
Perhaps the prof meant that such a consequentialist account appears to tell you to follow certain "deontological" requirements, but for the wrong reason in some way. In much the same way that the existence of a vengeful God might make acting morally also selfishly rational, but if you acted morally out of self-interest then you would be doing it for the wrong reasons, and wouldn't have actually got to the heart of things.
Alternatively, they're just useless. Philosophy has a pretty high rate of that, but don't throw out the baby with the bathwater! ;)
I agree that "right for the wrong reasons" is an indictment of your epsitemic process: it says that you made a prediction that turned out correctly, but that actually you just got lucky. What is important for making future predictions is being able to pick the option that is most likely, since "being lucky" is not a repeatable strategy.
The moral for making better decisions is that we should not praise people who predict prima facie unlikely outcomes -- without presenting a strong rationale for doing so -- but who then happen to be correct. Amongst those who have made unusual but successful predictions we have to distinguish people who are reliably capable of insight from those who were just lucky. Pick your contrarians carefully.
There's a more complex case where your predictions are made for the "wrong" reasons, but they are still reliably correct. Say you have a disorder that makes you feel nauseous in proportion to the unlikeliness of an option, and you habitually avoid options that make you nauseous. In that case, it seems more that you've hit upon a useful heuristic than anything else. Gettier cases aren't really like this because they are usually more about luck than about reliable heuristics that aren't explicitly "rational"
Great post! I wish Harsanyi's papers were better known amongst philosophers.
Mainstream philosophy translation: moral concepts rigidly designate certain natural properties. However, precisely which properties these are was originally fixed by certain contingent facts about the world we live in and human history.
Hence the whole "If the world had been different, then what is denoted by "morality" would have been different, but those actions would still be immoral (given what "morality" actually denotes)" thing.
This position is sometimes referred to as "sythetic ethical naturalism".
I'm still worried about the word "model". You talk about models of second-order logic, but what is a model of second-order logic? Classically speaking, it's a set, and you do talk about ZF proving the existence of models of SOL. But if we need to use set theory to reason about the semantic properties of SOL, then are we not then working within a first-order set theory? And hence we're vulnerable to unexpected "models" of that set theory affecting the theorems we prove about SOL within it.
It seems like you're treating "model" as if it were a fundamental concept, when in fact the way it's used in mathematics is normally embedded within some set theory. But this then means you can't robustly talk about "models" all the way down: at some point your notion of model bottoms out. I don't think I have a solution to this, but it feels like it's a problem worth addressing.
It's like the opposite of considering the Least Convenient Possible World; the Most Convenient Possible World! Where everything on my side turns out as well as possible, and everything on yours turns out as badly as possible.
I'm pretty sure that the idea of the previous two paragraphs has been talked about before, but I can't find where.
It's pretty commonly discussed in the philosophical literature on utilitarianism.
I think most of this worrying is dissolved by better philosophy of mathematics.
Infinte sets can be proven to exist in ZF, that's just a consequence of the Axiom of Infinity. Drop the axiom, and you can't prove them to exist. You're perfectly welcome to work in ZF-Infinity if you like, but most mathematicians find ZF to be more interesting and more useful. I think the mistake is to think that one of these is the "true" axiomatization of set theory, and therefore there is a fact of the matter over whether "infinite sets exist". There are just the facts about what is implied by what axioms.
If you're worried about how we think about implication in logic without assuming set theory, perhaps even set theory with Infinity, then I agree that that's worrying, but that's not particularly an issue with infinity.
Then, on the other hand, you might wonder whether some physical thing, like the universe, is infinite. That's now a philosophy of science question about whether using infinite sets or somesuch in our physical theories is a good idea. Still pretty different.
Aside: your specific arguments are invalid.
- The indistinguishability argument, regardless of whether it's good in principle, is incorrect. For infinite X, X and X'=X \union {x} are distinguishable in ZF. For one thing, X' is a strict superset of X, so if you want a set (a "property") that contains X but not X', try the powerset of X. I'm not really sure what else you mean by "indistinguishability".
- In the relative frequency argument you do limits wrong. It can be the case that lim f(x) and lim g(x) are both undefined, but that lim f(x)/g(x) is fine.
This definitely seems to be a post-metaethics post: that is, it assumes something like the dominant EY-style metaethics around here (esp the bit about "intrinsic moral uncertainty"). That's fine, but it does mean that the discussion of moral uncertainty may not dovetail with the way other people talk about it.
For example, I think many people would gloss the problem of moral uncertainty as being unsure of which moral theory is true, perhaps suggesting that you can have a credence over moral theories much like you can over any other statement you are unsure about. The complication, then, is calculating expected outcomes when the value of an outcome may itself depend on which moral theory is true.
I'm not sure whether you'd class that kind of uncertainty as "epistemic" or "intrinsic".
You could also have metaethical uncertainty, which makes the whole thing even more complex.
Oh, I see. Sorry, I misinterpreted you as being sceptical about the normal usage of "purpose". And nope, I can't give a taboo'd account of it: indeed, I think it's quite right that it's a confused concept - it's just that it's a confused concept not a confused use of a normal concept.
I'd claim that there is a distinct concept of "purpose" that people use that doesn't entail an agent with that purpose. It may be a pretty unhelpful concept, but it's one that people use. It may also have arisen as a result of people mixing up the more sound concept of purpose.
I think you're underestimating people who worry about "ultimate purpose". You say they "don't even understand the context", as opposed to people who "understand the full context of the concept". I'm not sure whether you're just being a linguistic prescriptivist here, but if there are a whole bunch of people using a word in a different way to the way it's normally used, then I'm inclined to think that the best way to understand that is that they mean something different than that, not that they're idiots who don't understand the word properly.
"What's the point of that curious tool in your shed?"
"Oh, it's for clearing weeds."
The purpose of the tool is to clear weeds. This is pretty underdetermined: if I used it to pick my teeth then there would be a sense in which the purpose of the tool was to act as a toothpick, and a sense in which I was using it for a purporse unintended by its creator, say.
Importantly, this isn't supposed to be a magically objective property of the object, no Aristotelian forms here! It's just a feature of how people use or intend to use the object.
+1 nitpickiness.
And Eliezer makes the same mistake in the linked article too ;) Not that it exactly matters!
If we're naming fallacies, then I would say that this post commits the following:
The Linguistic Consistency Fallacy: claiming, implicitly or otherwise, that a word must be used in the same way in all instances.
A word doesn't always mean the same thing even if it looks the same. People who worry about the purpose of life aren't going to be immediately reassured once you point out that they're just missing one of the relata. "Oh, silly me, of course, it's a three-place relation everywhere else, so of course I was just confused when I was using it here". If you ask people who are worrying about the purpose or meaning of life, "Purpose for whom?", in my experience they tend to say something like "Not for anyone in particular, just sort of "ultimate" purpose". Now, "ultimate purpose" may well be a vague concept, or one that we get somehow tricked into caring about, but it's not simply an example of people making a trivial mistake like leaving off one of the relata. People genuinely use the word "purpose" in different (but related) ways.
That said, the fact that everywhere else we use the word "purpose" it is three-place is certainly a useful observation. It might make us think that perhaps the three-place usage is the original, well-supported version, and the other one is a degenerate one that we are only using because we're confused. But the nature of that mistake is quite different.
If you think I'm splitting hairs here, think about whether this post feels like a satisfying resolution to the problem. Insofar as I still feel the pull of the concept of "ultimate purpose", this post feels like it's missing the point. It's not that "ultimate purpose" is just a misuse of the word "purpose", which, by the Linguistic Consistency Fallacy, must be used in the same way everywhere, it's that it's a different concept which is, for various reasons, a confused one.
FWIW I think "2-Place and 1-Place Words" is a bit dubious for similar reasons. Both this post and that make the crucial observation that we have this confusing concept that looks like it's a good concept "partially applied", but use this to diagnose the problem as incorrect usage of a concept, rather than viewing it as a perhaps historical account of how that confused concept came about.
Like I said, sort of splitting hairs, but it makes all the difference if you're trying to un-confuse people.
By semantics I mean your notion of what's true. All I'm saying is that if you think that you can prove everything that's true, you probably have a an overly weak notion of truth. This isn't necessarily a problem that needs to be "fixed" by being really smart.
Also, I'm not saying that our notion of proof is too weak! Looking back I can see how you might have got the impression that I thought we ought to switch to a system that allows infinite proofs, but I don't. For one thing, it wouldn't be much use, and secondly I'm not even sure if there even is such a proof system for SOL that is complete.
Absolutely, but it's one that happens in a different system. That can be relevant. And I quite agree: that still leaves some things that are unknowable even by supersmart AI. Is that surprising? Were you expecting an AI to be able to know everything (even in principle)?
They explicitly don't address that:
Second, it might seem that this approach to determining Personal CEV will require a reasonable level of accuracy in simulation. If so, there might be concerns about the creation of, and responsibility to, potential moral agents.
Ooookay. The whole "loop" thing feels like a leaky abstraction to me. If you had to do that much work to explain the loopiness (which I'm still not sold on) and why it's a problem, perhaps saying it's "loopy" isn't adding much.
This loses the sight of the original purpose: the evaluating criteria should be acceptable to the original person
I think I may still be misunderstanding you, but this seems wrong. The whole point is that even if you're on some kind of weird drugs that make you think that drinking bleach would be great, the idealised version of you would not be under such an influence, etc. Hence it might well be that the idealised advisors evaluate things in ways that you would find unaccepable. That's WAD.
Also, I find your other proposal hard to follow: surely if you've got a well-defined utility function already, then none of this is necessary?
It is!? Does anyone know a proof of Compactness that doesn't use completeness as a lemma?
Yes. Or, at least, I did once! That's the way we proved it the logic course I did. The proof is a lot harder. But considering that the implication from Completeness is pretty trivial, that's not saying much.
Great post! It's really nice to see some engagement with modern philosophy :)
I do wonder slightly how useful this particular topic is, though. CEV and Ideal Avisor theories are about quite different things. Furthermore, since Ideal Advisor theories are working very much with ideals, the "advisors" they consider are usually supposed to be very much like actual humans. CEV, on the other hand, is precisely supposed to be an effective approximation, and so it would seem surprising if it were to actually proceed by modelling a large number of instances of a person and then enhancing them cognitively. So if instead it proceeds by some more approximate (or alternatively, less brute-force) method, then it's not clear that we should be able to apply our usual reasoning about human beings to the "values advisor" that you'd get out of the end of CEV. That seems to undermine Sobel's arguments as applied to CEV.
This comment reads to me like: "Haha, I think there are problems with your argument, but I'm not going to tell you what they are, I'm just going to hint obliquely in a way that makes me look clever."
If you actually do have issues with Sobel's arguments, do you think you could actually say what they are?
A lot of what you've said sounds like you're just reiterating what Luke says quite clearly near the beginning: Ideal Advisor theories are "metaphysical", and CEV is epistemic, i.e. Ideal Advisor theories are usually trying to give an account of what is good, whereas, as you say, CEV is just about trying to find a good effective approximation to the good. In that sense, this article is comparing apples to oranges. But the point is that some criticisms may carry over.
[EDIT: this comment is pretty off the mark, given that I appear to be unable to read the first sentence of comments I'm replying to. "historical context" facepalm]
I absolutely agree that this will help people stop being confused about Godel's theorem, I just don't know why EY does it in this particular post.
Do you have any basis for this claim?
Nope, it's pure polemic ;) Intuitively I feel like it's a realism/instrumentalism issue: claiming that the only things which are true are provable feels like collapsing the true and the knowable. In this case the decision is about which tool to use, but using a tool like first-order logic that has these weird properties seems suspicious.
Oh yeah - brain fail ;)
the compactness theorem is equivalent to the ultrafilter lemma, which in turn is essentially equivalent to the statement that Arrow's impossibility theorem is false if the number of voters is allowed to be infinite.
Well, I can confirm that I think that that's super cool!
the compactness theorem is independent of ZF
As wuncidunci says, that's only true if you allow uncountable languages. I can't think of many cases off the top of my head where you would really want that... countable is usually enough.
Also: more evidence that the higher model theory of first-order logic is highly dependent on set theory!
I think it's worth addressing that kind of argument because it is fairly well known. Penrose, for example, makes a huge deal over it. Although mostly I think of Penrose as a case study in how being a great mathematician doesn't make you a great philosopher, he's still fairly visible.
Exactly (I'm assuming by subset you mean non-strict subset). Crucially, a non-standard model may not have all the bijections you'd expect it to, which is where EY comes at it from.
Sure. So you're not going to be able to prove (and hence know) some true statements. You might be able to do some meta-reasoning about your logic to figure some of these out, although quite how that's supposed to work without requiring the context of set theory again, I'm not really sure.
This post doesn't really say anything interesting about retributive justice at all. It sounds like what's actually bugging you is the question of national "sovereignty". Plus, you Godwined yourself. Between these things, you give a pretty bad impression. Perhaps if you reposted it with a less flamebaity example and a title like "Is there any ethical reason to respect national sovereignty?" or something you might fare better.
A few things.
a) I'm a little confused by the discussion of Cantor's argument. As I understand it, the argument is valid in first-order logic, it's just that the conclusion may have different semantics in different models. That is, the statement "the set X is uncountable" is cashed out in terms of set theory, and so if you have a non-standard model of set theory, then that statement may have non-standard sematics.
This is all made horrendously confusing by the fact that when we do model theory we tend to model our domains using sets. So even in a non-standard model of set theory we'll usually be talking about sets doing the modelling, and so we may actually be using a set that is countable in the "outer" set theory in which we're working, but not in the "inner" theory which we're modelling.
What the requirement to use set theory to talk about first-order logic says about the status of logic is a whole other hopelessly circular kettle of fish.
Anyway, I think that's basically what you were saying, but I actually found your explanation harder to follow than the usual one. Which is unusual, since I normally think your explanations of mathsy stuff are very good!
b) I kind of feel like Godel's theorem could be dropped from this post. While it's nice to reiterate the general point that "If you're using Godel's theorem in an argument and you're not a professional logician, you should probably stop", I don't think it actually helps the thrust of this post much. I'd just use Compactness.
c) The Compactness theorem is the best reason not to use first-order logic. Seriously, it's weird as hell. We're all so used to it from doing model theory etc, but it's pretty counter-intuitive full stop; doesn't correspond to how we normally use logic; and leads to most of the "remarkable" properties of first-order logic.
Your semantics is impoverished if you can prove everything with finite syntactical proofs. Down with Compactness!
Right. So, as I said, you are counselling that "anthropics" is practically not a problem, as even if there is a sense of "expect" in which it would be correct to expect the Boltzmann-brain scenario, this is not worth worrying about because it will not affect our decisions.
That's a perfectly reasonable thing to say, but it's not actually addressing the question of getting anthropics right, and it's misleading to present it as such. You're just saying that we shouldn't care about this particular bit of anthropics. Doesn't mean that I wouldn't be correct (or not) to expect my impending dissolution.