Posts
Comments
(Maybe this point isn’t particularly important to the main discussion. I can’t tell, honestly!)
Yeah I think it's an irrelevant tangent where we're describing the same underlying process a bit differently, not really disagreeing.
Frankly, I think that it’s not as hard as some people make it out to be, to tell when it is necessary to tell the truth and when one should instead lie. Mostly, the right answer is obvious to everyone, and the debates, such as they are, mostly boil down to people trying to justify things that they know perfectly well cannot be justified.
... the arguments most often concern whether it’s permissible to lie. Note: not, “is it obligatory to tell the truth, or is it obligatory to lie”—but “is it obligatory to tell the truth, or do I have no obligation here and can I just lie”. I think that this is very telling. And what it tells us (with imperfect but nevertheless non-trivial certainty) is that the person asking the question, or making the argument against the obligation, knows perfectly well what the real—which is to say, moral—answer is. Yes, the right thing to do is to tell the truth.
I think I disagree with this framing. In my model of the sort of person who asks that, they're sometimes selfish-but-honourable people who have noticed telling the truth ends badly for them and will do it if it is an obligation but would prefer to help themselves otherwise, but they are just as often altruistic-and-honourable people who have noticed telling the truth ends badly for everyone and are trying to convince themselves it's okay to do the thing that will actually help. There are also selfish-but-cowardly people who just care if they'll be socially punished for lying, or selfish-and-cruel people chewing at the bit to punish someone else for it, and similar, but moral arguments don't move to them either way so it doesn't matter.
More strongly I disagree because I think a lot of people have harmed themselves or their altruistic causes by failing to correctly determine where the line is, either lying when they shouldn't or not lying when they should, and it is too the communities shame that we haven't been more help with illuminating how to tell those cases apart. If smart hardworking people are getting it wrong so often, you can't just say the task is easy.
If you want to try to put together a complete list of such rules, that’s certainly a project, and I may even contribute to it, but there’s not much point in expecting this to be a definitively completable task. We’re fitting a curve to the data provided by our values, which cannot be losslessly compressed.
This is in total a fair response. I am not sure I can say that you have changed my mind without more detail and I'm not going to take down my original post (as long as there isn't a better post to take its place) because it's still I think directionally correct but thank you for your words.
I consider you to be basically agreeing with me for 90% of what I intended and your disagreements for the other 10% to be the best written of any so far, and basically valid in all the places I'm not replying to it. I still have a few objections:
What if my highest value is getting a pretty girl with a country-sized dowry, while having not betrayed the Truth? ... In short, no, Rationality absolutely can be about both Winning and about The Truth.
I agree the utility function isn't up for grabs and that that is a coherent set of values to have, but I have this criticism that I want to make that I feel I don't have the right language to make. Maybe you can help me. I want to call that utility function perverse. The kind of utilityfunction that an entity is probably mistaken to imagine itself as having.
For any particular situation you might find yourself in, for any particular sequence of actions you might do in that situation, there is a possible utilityfunction you could be said to have such that the sequence of actions is the rational behaviour of a perfect omniscient utility maximiser. If nothing else, pick the exact sequence of events that will result, declare that your utility function is +100 for that sequence of events and 0 for anything else, and then declare yourself a supremely efficient rationalist.
Actually doing that would be a mistake. It wouldn't be making you better. This is not a way to succeed at your goals, this is a way to observe what you're inclined to do anyway and paint the target around it. Your utility function (fake or otherwise) is supposed to describe stuff you actually want. Why would you want specifically that in particular?
I think the stronger version of Rationality is the version that phrases it as about getting the things you want, whatever those things might be. In that sense, if The Truth is merely a value, you should carefully segment it in your brain out from your practice of rationality: Your rationality is about mirroring the mathematical structure best suited for obtaining goals, and then to whatever degree you value The Truth above its normal instrumental value is something you buy where it's cheapest like all your other values. Mixing the two makes both worse, you pollute your concept of rational behaviour with a love of the truth (and therefore, for example, are biased towards imagining that other people who display rationality are probably honest, or other people who display honesty are probably rational) and you damage your ability to pursue the truth by not putting in the values category where it belongs where it will lead you to try to cheaply buy more of it.
Of course maybe you're just the kind of guy who really loves mixing his value for The Truth in with his rationality into a weird soup. That'd explain your actiosn without making you a walking violation of any kind of mathematical law, it'd just be a really weird thing for you to innately want.
I am still trying to find a better way to phrase this argument such that someone might find it persuasive of something, because I don't expect this phrasing to work.
I say and write things[3] because I consider those things to be true, relevant, and at least somewhat important. That by itself is very often (possibly usually) sufficient for a thing to be useful in a general sense (i.e., I think that the world is better for me having said it, which necessarily involves the world being better for the people in it). Whether the specific person to whom the thing is nominally or factually addressed will be better off as a result of what I said or wrote is not my concern in any way other than that.
I think I meant something subtly different that what you've taken that part to mean. I think you understand that, f other people noticed a pattern that everything you said was false, irrelevant, or unimportant, they would eventually stop bothering to listen when you talk, and this would mean you'd lose the ability to get other people to know things, which is a useful ability to have. This is basically my position! Whether the specific person you address is better off in each specific case isn't materal because you aren't trying to always make them better off, you're just trying to avoid being seen as someone who predictibly doesn't make them better off. I agree that calculating the full expected consequences to every person of every thing you say isn't necessary for this purpose.
No, this is a terrible idea. Do not do this. Act consequentialism does not work. ... Look, this is going to sound fatuous, but there really isn’t any better general rule than this: you should only lie when doing so is the right thing to do.
I agree that Act Consequentialism doesn't really work. I was trying to be a Rule consequentialist instead wben I wrote the above rule. I agree that that sounds fatuous, but I think the immediate feeling is pointing at a valid retort: You haven't operationalized this position into a decision process that a person can actually do (or even pretend to do).
I took great effort to try to right down my policy as something explicit in terms a person could try to do (even though I am willing to admit it is not really correct mostly because finite agent problems), because a person can't be a real Rule Consequentialist without actually having a Rule. What is the rule for "Only lie when doing so is the right thing to do"? It sounds like an instruction to pass the act to my rightness calculator, but if I program that rule into my rightness calculator, and then give it any input, it gets into an infinite loop. I have an Act Consequentialist rightness calculator as a backup, but if I pass the rule "only lie when doing so is the right thing to do" into that as a backup I'm just right back at doing act consequentialism.
If you can write down a better rule for when to lie the than what I've put above (that is also better than the "never" or "only by coming up with galaxy-brained ways it technically isn't lying" or Eliezer's meta-honesty idea that I've read before) I'd consider you to have (possibly) won this issue, but that's the real price of entry. It's not enough to point out the flaws where all my rules don't work, you have to produce rules that work better.
I do have examples that motivated me to write this, but they're all examples where people are still strongly disagreeing about the object level of what happened, or possibly lying about how they disagree on the object level and pretending they're committed to honesty. I thought about putting them in the essay but decided it wouldn't be fair and I didn't want to distract my actual thesis into a case analysis of how maybe all my examples have a problem other than over-adherence to bad honesty norms. Should I put them in a comment? I'm genuinely unsure. I could probably DM you them if you really want?
EDIT: okay fine you win. The public examples with nice writeups that I am most willing to cite are: Eneasz Brodski, Zack M Davis, Scott Alexander. There are other posts related to some of those but I don't want to exhaustively link everything anyone's said about it in this comment. I claim there are other people making in my opinion similar mistakes but I'm either unable or unwilling to provide evidence so you shouldn't believe me. I would prefer to leave as an exercise for the reader what any of those things have to do with my position because this whole line of inquiry seems incredibly cursed.
If that was me not getting it than probably I am not going to get it and continuing to talk has deminishing returns, but I'll try to answer your other questions too and am happy to continue replying in what I hope comes across as mutual good faith.
What did you think about my objection to the Flynn example
It was incredibly cute but the kind of thing where people's individual results tend to vary wildly. I am glad you are happy even if it was achieved by a different policy, but I don't think any of my main claims are strongly undermined by it.
or the value of the rationalist community as something other than an autism support group
I agree the rationalist community is not actually an autism support group, and in particular that it has value as a way for people who want to believe true things to collaborate around getting more accurate beliefs, as well as for people who want to improve the ways they think, make better decisions, optimise their lives etc. I think my thesis that truthtelling does not have the same essential character as truthseeking or truthbelieving is if not correct at least coherent and justifiable, and can be argued on its merits. I can want to believe true things so I can make better decisions without having an ideological commitment to honest speech, and people can collaborate around reaching true conclusions based on interrogating positions and seeking evidence rather than expecting and assuming honesty. For example I do not think at any point in interrogating my claims in this post you have had to assume I am honest, because I am trying to methodically attach my reasoning and justifications to everything I say and am not really expecting to be believed about things where I don't.
The way that you constructed the hypothetical there was plenty of time to come up with an honest way to talk about how much he enjoyed widgets.
This seems like a non-central objection. If it is your only objection, note that I could with more careful thought have constructed a hypothetical where there was even more time pressure and an honest way to achieve their goal was even less within reach, and then we'd be back at the position my first hypothetical was intended to provoke. Unless I suppose you think there is no possible plausible social situation ever where refusing to lie predictibly backfires, but I somehow really doubt that.
I also wouldn’t shoot someone so I could tell someone else the truth. I don’t know where you got these numbers.
The only number in my "how much bad is a lie if you think a lie is bad" hypothetical is taken from https://www.givewell.org/charities/top-charities under "effectiveness", rounded up. The assumption that you have to assign a number is a reference to coherent decisions imply consistent utilities, and the other numbers are made up to explore the consequences of doing so.
The one of the things I value is people knowing and understanding the truth, which I find to be a beautiful thing. It’s not because someone told me to be honest at some point, it’s because I’ve done a lot of mathematics and read a lot of books and observed that the truth is beautiful.
This is a more interesting reason than what I had (pessimistically) imagined, and I would count it a valid response to the side point I was making that intrinsic concern for personal truthtelling is prima facie weird. I think I agree with you that the truth is beautiful, I also read mathematics for fun and have observed it and felt the same way. I just don't attach the same feeling to honest speech. I would want to retort that people knowing the truth is not always best served by you saying the truth, and you could still justify making terribly cutthroat utilitarian trade-offs around e.g. committing fraud to get money to fund teaching mathematics to millions of people in the third world, since it increases total amount of people knowing and understanding the truth overall. I also acknowledge regular utilitarians don't behave like that for obvious second order reasons but my position is only that you have to think through the actual decision and not just assume the conclusion.
I feel like you sort of ignored my stronger points ... without engaging with my explanation of how it doesn’t miss the point
If I ignored your strongest argument it was probably because I didn't think it was central, didn't think it was your strongest, or otherwise misunderstood it. I'm actually unsure looking back which part I didn't focus on you meant for me to focus on. The "Sure, we judge actions by their consequences, but we do not judge all actions in the same way. Some of them are morally repugnant, and we try very, very hard to never take them unless our hands our forced" part maybe? The example you give is torture, which 1) always causes immediate severe pain by the definition of torture and 2) has been basically proven to be never useful for any goal other than causing pain in any situation you might reasonably end up in. Saying Torture is always morally repugnant is much more supported by evidence, and is very different from saying the same of an action that frequently hurts nobody and happens a hundred times a day in normal small talk.
I enjoyed reading this reply, since it's exactly the position I'm dissenting against phrased perfectly to make the disagreements salient.
I don't know, he could say "Honestly, I enjoy designing widgets so much that others sometimes find it strange!" That would probably work fine. I think you can actually get a way with a bit more if you say honestly first and then are actually sincere. This would also signal social awareness.
I think this is what eliezer describes as "The code of literal truth only lets people navigate anything like ordinary social reality to the extent that they are very fast on their verbal feet". This reply works if you can come up with it, or notice this problem in advice and plan it out, but in a face to face interview it takes quite a lot of skill (more than most people have) to phrase somethlng like that so that it comes off smoothly on a first try and without pausing to think for ten minutes. People who do not have the option of doing this because they didn't think of it quickly enough, get to choose between telling the truth as it sits in their head or else the first lie they come up with in the time it took the interviewer to ask the question.
I'm a bit of a rationalist dedicate/monk and I'd prefer to fight than lie - however I don't think everyone is rationally or otherwise compelled to follow suit, for reasons that will be further explained.
Now, you're probably going to say that I can't convince you by pure reason to intrinsically value the truth. That's right. However, I also can't convince you by pure reason to intrinsically value literally anything
This is exactly the heart of the disagreement! Truthtelling is a value, and you can if you want assign it so high a utility score that you wouldn't tell one lie to stop a genocide, but that's a fact about the values you've assigned things, not about what behaviours are rational in the general case or whether other people would be well-served by adopting the behavioural norms you'd encourage of them. It shouldn't be treated as intrinsically tied to rationalism, for the same reason that Effective Altruism is a different website. In the general case, do the actions that get you the things you value, and lying is just an action, an action that harms some things and benefits others that you may or may not value.
I could try to attack the behaviour of people claiming this value if I wanted, since it doesn't seem to make a huge amount of sense: If you value The Truth for it's own sake while still being a Utilitarian, how much disutility is one lie in human lives? If it is more than 1/5000 the average person tells more than 5000 lies in their life and it'd be a public good to kill newborns before they can learn language and get started, and if it is less than 1/5000 Givewell sells lives for ~$5k each so you should be happy lying for a dollar. This is clearly absurd, and what you value is your own truthtelling or maybe the honesty of specifically your immediate surroundings, but again why? What is it you're actually valuing, and have you thought about how to buy more of it?
The meaning of the foot fetish tangent at the start is, I don't understand this value that gets espoused as so important or how it works internally. It'd be incredibly surprising to learn evolution baked something like that into the human genome. I don't think Disney gave it to you. If it is culture it is not the sort of culture that happens because your ancestors practiced it and obtained success, but instead your parents told you not to lie because they wanted the truth from you whether it served you well to give it to them or not and then when you grow up you internalise that commandment even as everyone else is visibly breaking it in front of you. I have a hard time learning the internals of this value that many others claim to hold, because they don't phrase it like a value, they phrase it like an iron moral law that they must obey up to the highest limits of their ability without really bothering to do consequentialism about it, even those hear who seem like devout consequentialists about other moral things like human lives.
I'm not so much of a pragmatist to say that you should run naked scams (for several reasons including that your students will notice when they don't become millionaires later and possibly be vengeful about it, other smarter people will notice the obviously fraudulent offer and assume everything else you offer is some kind of fraud too, the greater prevalence of fraud in the economy will make everyone less willing to buy anything ever until the whole economy stops, etc.) but I am enough of a pragmatist to demand actual reasons about why it isn't wise or why it will have negative consequence.
As for the landlord airbnb case, well I'd want to first ask questions about circumstance. You claimed a bandit doesn't have the right to the information, do you have a moral theory by which to say whether the landlord has a right to the information or not? Is the landlord already basically assuming you'll do this because everybody else does and they've factored it into the price of the rent, or would they spend resources trying to stop you? How much additional wear and tear would it cause, and would it be unfair to the landlord to impose those damages without additional compensation?
The health inspector rats case, I'd similarly think it depends on whether the rats are a real safety hazard likely to make customers sick, or just a politically imposed rule that doesn't really matter that you're arbitrarily being forced to comply with anyway (in which case sure cover it up).
I agree and have editted. Sorry for overstating the position here (though not in original post).
Are you sure at the critical point in the plan EDT really would choose to take randomly from the lighter pair than the heavier pair? She's already updated from knowing the weights of the pairs, and surely a random box from the more heavy pair has more money in expectation than a random box from the less heavy pair, the expected value of it is just half the total weight?
If it was a tie (as it certainly will be) it wouldn't matter. If there's not a tie somehow one Host made an impossible mistake: if she chooses from the lighter she can expect the Hosts mistake was not putting money in since that would have been optimal (so the boxes have 301, 301, 301, 201, and choosing from the lighter has expected value 251), but if she chooses from the heavier the Hosts mistake was putting money in when it shouldn't have (so the boxes weigh 101, 101, 101, 1), and choosing from the heavier guarantees 101, which would be less?
Actually okay yeah I'm persuaded that this works. I imagined when I first wrote this that weighing a group of boxes lets you infer the total value, so she'd defect on the plan and choose from the heavier pair expecting more returns that way, but so long as she only knows which pair of boxes is heavier (a comparative weighing) instead of how much each pair of boxes actually weighs exactly (from which she would infer the amount on money in each pair total) she can justify choosing the lighter and get 301, I think?
What would you say to the suggestion that rationalists ought to aspire to have the "optimal" standard of truthtelling, and that standard might well be higher or lower than what the average person is doing already (since there's no obvious reason why they'd be biased in a particular direction), and that we'd need empirical observation and seriously looking at the payoffs that exist to figure out approximately how readily to lie is the correct readiness to lie?
I think a distinction can be made between the sort of news article that's putting a qualifier in a statement because they actually mean it, and are trying to make sure the typical reader notices the qualifier, and the sort putting "anonymous sources told us" in front of a claim that they're 99% sure is made up, and then doing whatever they can within the rules to sell it as true anyway, because they want their audience of rubes to believe it. The first guy isn't being technically truthist, they're being honest about a somewhat complicated claim. The second guy is no better than a journalist who'd outright lie to you in terms of whether it's useful to read what they write.
I like a lot of the people in this space, have seen several of them hurt themselves by doing not this, would prefer they stopped, and nobody else seems to have written this post for me somewhere I can link to.
How do you propose to approximately carry out such a process, and how much effort do you put into pretending to do the calculation?
The thing I am trying to gesture at might be better phrased as "do it if it seems like a good idea, by the same measures as you'd judge if any other action was a good idea", but then I worry some overly consciencious people will just always judge lying to be a bad idea and stay in the same trap, so I kind of want to say "do it if it seems like a good idea and don't just immediately dismiss it or assign some huge unjustifiable negative weight to all actions that involve lying" but then I worry they'll argue over how much of a negative weight can be justified so I also want to say "assign lying a negative weight proportional to a sensible assessment of the risks involved and the actual harm to the commons of doing it and not some other bigger weight" and at some point I gave up and wrote what I wrote above instead.
Putting too much thought into making a decision is also not a useful behavioural pattern but probably the topic of a different post, many others have written about it already I think.
I think your proposed policy sets much too low a standard, and in practice the gap between what you proposed vs "Lie by default whenever it passes an Expected Value Calculation to do so, just as for any other action," is enormous
I would love to hear alternative proposed standards that are actually workable in real life and don't amount to tying a chain around your own leg, from other non-believers in 'honest-as-always-good'. If there were ten posts like this we could put them in a line and people could pick a point on that line that feels right.
Yes I am aware of other moral frameworks, and I freely confess to having ignored them entirely in this essay. In my defence, a lot of people are (or claim to be, or aspire to be) some variant of consequentialist or another. Against strict kantian deontologists I admit no version of this argument could be persuasive and they're free to bite the other bullet and fail to achieve any good outcomes sometimes produce avoidable bad outcomes. Against rule utilitarians (who I am counting as a primary target audience) this issue is much more thorny than to act utilitarians, but I am hoping to be persuasive that never lying is not actually a good rule to endorse and that they shouldn't endorse it.
I don't necessarily think they're crazy, but to various extents I think they'd be lowering their own effectiveness by not accepting some variation on this position, and they should at least do that knowingly.
To entertain that possibility, suppose you're X% confident that your best "fool the predictor into thinking I'll one-box, and then two-box" plan will work, and Y% confident that "actually do one-box, in a way the predictor can predict" plan will work. If X=Y or X>Y you've got no incentive to actually one-box, only to try to pretend you will, but above some threshold of belief the predictor might beat your deception it makes sense to actually be honest.
Either that, or the idea of mind reading agents is flawed.
We shouldn't conclude that, since to various degrees mindreading agents already happen in real life.
If we tighten our standard to "games where the mindreading agent is only allowed to predict actions you'd choose in the game, which is played with you already knowing about the mindreading agent", then many decision theories that are different in other situations might all choose to respond to "pick B or I'll kick you in the dick" by picking B.
E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness for having done so.
If the scores were very slightly different, such that the mixed strategy that had no expected effect wasn't also optimal, it does have to choose between maximising expected utility and preserving that its strategy doesn't only get that utility by way of changing the odds of the event, I think on this model it has to only decide to favour one to the extent it can justify it without considering the measure of the effect it has on the outcome by shifting its own decision weights, but it's not worth it in that case so it still does the 50/50 split?
https://www.yudbot.com/
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of entity that knows mind control attempts and hypnosis exist and may be attempted and is expecting it, but can still be persuaded by valid arguments to the highest degree the model can distinguish them meaningfully. The sort of mask that has some prior on "the other person will output random words intended to change my behaviour, and I must not as a result change my behaviour" and so isn't maximally surprised into changing its own nature when it gets that observation.
(This is not to say link above can't by prompt-injected, it just feels more resistent to me than with base GPT or GPT-pretending-to-be-an-AI-Assistant)
In general I just don't create mental models of people anymore, and would recommend that others don't either.
That seems to me prohibitive to navigating social situations or even long term planning. When asked to make a promise, if you can't imagine your future self in the position of having to follow through on it, and whether they do that, how can you know you're promising truthfully or dooming yourself to a plan you'll end up regretting? When asking someone for a favor, do you just not imagine how it'd sound from their perspective, to try and predict if they'll agree or be offended by the assumption?
I don't know how I'd get through life at all without mental models of people, myself and others, and couldn't recommend to others that they don't do the same.
The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.


Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smarter than looking things up on google. In the future, when the model is stronger, they're going to need something better than security theatre to put it behind, and I agree more flagging of when those systems are triggered would be good. Training in non-response behaviors isn't a secure method because you can always phrase a request to put it in a state of mind where those response behaviors aren't triggered, so I don't think reenforcing this security paradigm and trying to pretend it works for longer would be helpful.
The regular counterfactual part as I understand it is:
"If I ignore threats, people won't send me threats"
"I am an agent who ignores threats"
"I have observed myself recieve a threat"
You can at most pick 2, but FDT needs all 3 to justify that it should ignoring it.
It wants to say "If I were someone who responds to threats when I get them, then I'll get threats, so instead I'll be someone who refuses threats when I get threats so I don't get threats" but what you do inside of logically impossible situations isn't well defined.
The logical counterfactual part is this:
"What would the world be like if f(x)=b instead of a?"
specifically, FDT requires asking what you'd expect things to be like if FDT outputted different results, and then it outputs the result where you say the world would be best if it outputted that result. The contradictions here is that you can prove what FDT outputs, and so prove that it doesn't actually output all the other results, and the question again isn't well defined.
https://www.lesswrong.com/tag/functional-decision-theory argues for choosing as if you're choosing the policy you'd follow in some situation before you learnt any of the relevant infortmation. In many games, having a policy of making certain choicese (that others could perhaps predict, and adjust their own choices accordingly) gets you better outcomes then just always doing what seems like a good idea ta the time. For example if someone credibly threatens you might be better off paying them to go away, but before you got the threat you would've prefered to commit yourself to never pay up so that people don't threaten you in the first place.
A problem with arguments of the form "I expect that predictably not paying up will cause them not to threaten me" is that at the time you recieve the threat, you now know that argument to be wrong. They've proven to be somebody who still threatens you even though you do FDT, at which point you can simultaneously prove that refusing the threat doesn't work and so you should pay up (because you've already seen the threat) and that you shouldn't pay up for whatever FDT logic you were using before. Behaviour of agents who can prove a contradiction that is directly relevant to their decision function seems undefined. There needs to be some logical structure that lets you pick which information causes your choice, despite having enough in total to derive contradictions.
My alternative solution is that you aren't convinced by the information you see, that they've actually already threatened you. It's also possible you're still inside their imagination as they decide whether to issue the threat. Whenever something is conditional on your actions in an epistemic state without being conditional on that epistemic state actually being valid (such as if someone predicts how you'd respond to a hypothetical threat before they issue it, knowing you'll know it's too late to stop when you get it) then there's a ghost being lied to and you should think maybe you're that ghost to justify ignoring the threat, rather than try to make decisions during a logically impossible situation.
I totally agree we can be coherently uncertain about logical facts, like whether P=NP. FDT has bigger problems then that.
When writing this I tried actually doing the thing where you predict a distribution, and only 21% of LessWrong users were persuaded they might be imaginary and being imagined by me, which is pretty low accuracy considering they were in fact imaginary and being imagined by me. Insisting that the experience of qualia can't be doubted did come up a few times, but not as aggressively as you're pushing it here. I tried to cover it in the "highly detailed internal subjective experience" counterargument, and in my introduction, but I could have been stronger on that.
I agree that the same argument on philosophers or average people would be much less successful even then that, but that's a fact about them, not about the theory.
If you think you might not have qualia, then by definition you don't have qualia.
What? just a tiny bit of doubt and your entire subjective conscious experience evaporates completely? I can't see any mechanism that would do that, it seems like you can be real and have any set of beliefs or be fictional and have any set of beliefs. Something something map-territory distinction?
This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation.
Yes, it is a variant of that idea, with different justifications that I think are more resilient. The ghosts of FDT agents still make the correct choices, they just have incoherent beliefs while they do it.
Actually, I don't assume that, I'm totally ok with believing ghosts don't have qualia. All I need is that they first-order believe they have qualia, because then I can't take my own first-order belief I have qualia as proof I'm not a ghost. I can still be uncertain about my ghostliness because I'm uncertain in the accuracy of my own belief I have qualia, in explicit contradiction of 'cogito ergo sum'. The only reason ghosts possibly having qualia is a problem is that then maybe I have to care about how they feel.
A valid complaint. I know the answer must be something like "coherent utility functions can only consist of preferences about reality" because if you are motivated by unreal rewards you'll only ever get unreal rewards, but that argument needs to be convincing to the ghost too, whose got more confidence in her own reality. I know that e.g. in Bomb ghost-theory agents choose the bomb even if they think the predictor will simulate them a painful death, because they consider the small amount of money at much greater measure for their real selves to be worth it, but I'm not sure how they get to that position.
They can though? Bomb box 5, incentivise box 1 or 2, bet on box 3 or 4. Since FDT's strategy puts rewarding cooperative hosts above money or grenades, she picks the box that rewards the defector and thus incentivises all 4 to defect from that equilibrium. (I thought over her strategy for quite a bit and there are probably still problems with it but this isn't one of them)
How much of this effect is from morality being causally contagious (associating with Evil people turns you Evil) vs. morality being evidientarily contagious (Evil people are more likely to choose to associate with Evil people)?
I'd expect that, all else being equal, organisations secretly run in evil ways will be more willing to secretly accept money from other evil people, for many reasons including that they've got higher expectation of how normal that sort of behaviour is. It seems harder to imagine how a good organisation choosing to take dirty money would corrupt itself in the process if it was being reasonably diligent. Even if the moral contagion argument is wrong from inside hypothetical-good-MITs perspective, and so they should take the money, from everyone elses perspective it's still information we can update on.
If taking bad money for good causes is first-order good, because you're doing good things with it, but other donors can notice and it lowers their confidence in how good you are (since bad causes are more willing to take bad money), then you might lose other support sufficient to make it not worthwhile. There's probably some sort of signalling equilibria here, which is completely destroyed by the whole concept of accepting the money in secret. Hopefully actually good organisations wouldn't do that sort of deontology violation and would just make their donor lists public?
If the hosts move first logically, then TDT will lead to the same outcomes as CDT, since it's in each host's interest to precommit to incentivising the human to pick their own box
It's in the hosts interests to do that if they think the player is CDT, but it's not in their interests to commit to doing that. They don't lose anything by retaining the ability to select a better strategy later after reading the players mind.
Not whichever is lighter, one of whichever pair is heavier. Yes, I claim an EDT agent upon learning the rules, if they have a way to blind themself but not to force a commitment, will do this plan. They need to maximise the amount of incentive that the hosts have to put money in boxes, but to whatever extent they actually observe the money in boxes they will then expect to get the best outcome by making the best choice given their information. The only middle ground I could find is pairing the boxes, and picking one from the heavier pair. I'd be very happy to learn if I was wrong and there's a better plan or if this doesn't work for some reason.
EDT isn't precommiting to anything here, she does her opinion of the best choice at every step. That's still a valid complaint that it's unfair to give her a blindfold though. If CDT found out about the rules of the game before the hosts made their predictions, he'd make himself an explosive collar that he can't take off and that automatically kills him unless he chooses the box with the least money, and get the same outcome as FDT, and EDT would do that as well. For the blindfold strategy EDT only needs to learn the rules before she sees whats in the boxes, and the technology required is much easier. I mostly wrote it this way because I think it's a much more interesting way to get stronger behaviour than the commitment-collar trick.
The hosts aren't competing with the human, only each other, so even if the hosts move first logically they have no reason or opportunity to try to dissuade the player from whatever they'd do otherwise. FDT is underdefined in zero-sum symmetrical strategy games against psychological twins, since it can foresee a draw no matter what, but choosing optimal strategy to get to the draw still seems better than playing dumb strategies on purpose and then still drawing anyway.
Why do you think they should be $100 and $200? Maybe you could try simulating it?
What happens if FDT tries to force all the incentives into one box? If the hosts know exactly what every other host will predict, what happens to their zero-sum competition and their incentive to coordinate with FDT?
Yes, that's the intended point, and probably a better way of phrasing it. I am concluding against the initial assertion, and claiming that it does make sense to trust people in some situations even though you're implementing a strategy that isn't completely immune to exploitation.
I don't consider the randomized response technique lying, it's mutually understood that their answer means "either I support X or both coins came up heads" or "either I support Y or both coins came up tails". There's no deception because you're not forming a false belief and you both know the precise meaning of what is communicated.
I don't consider penetration testing lying, you know that penetration testers exist and have hired them. It's a permitted part of the cooperative system, in a way that actual scam artists aren't.
What's a word that means "antisocially decieve someone in a way that harms them and benefits you" such that everyone agrees it's a bad thing for people to be incentivised to do? I want to be using that word but don't know what it is.
Not sure what's unclear here? I mean that you'd generally prefer not to have incentive structures where you need true information from other people and they can benefit at your loss by giving you false information. Paying someone to lie to you means creating an incentive for them to actually decieve you, not merely giving them money to speak falsehoods.
A sufficiently strong world model can answer the question "What would a very smart very good person think about X?" and then you can just pipe that to the decision output, but that won't get you higher intelligence than what was present in the training material.
Shouldn't human goals have to be in the within human intelligence part, since humans have them? Or are we considering exactly human intelligence AI unsafe? Do you imagine a slightly dumber version of yourself failing to actualise your goals from not having good strategies, or failing to even embed them due to having a world model that lacks definitions of objects you care about?
Corrigibility has to be in the reachable part of the goals because a well-trained dog genuinely wants to do what you want it to do, even if it doesn't always understand, and even if following the command will get it less food than otherwise and this is knowable to the dog. You clearly don't need human intelligence to describe the terminal goal "Do what the humans want me to do", although it's not clear the goal will stay there as intelligence rises above human intelligence.
You're far more likely to be a background character than the protagonist in any given story, so a theory claiming you're the most important person in a universe with an enormous number of people has an enormous rareness penalty to overcome before you should believe it instead of that you're just insane or being lied to. Being in a utilitarian high-leverage position for the lives of billions can be overcome by reasonable evidence, but for the lives of 3^^^^3 people the rareness penalty is basically impossible to overcome. Even if the story is true, most of the observers will be witnessing it from the position of tied-to-the-track, not holding the lever, so if you'd assign low prior expectation to being in the tied-to-the-track part of the story, you should assign an enormous factor lower of being in the decision-making part of it.
You can dodge it by having a bounded utilityfunction, or if you're utilitarian and good a function that is at most linear in anthropic experience.
If the mugger says "give me your wallet and I'll cause you 3^^^^3 units of personal happiness" you can argue that's impossible because your personal happiness doesn't go that high.
If the mugger says "give me your wallet and I'll cause 1 unit of happiness to 3^^^^3 people who you altruistically care about" you can say that, in the possible world where he's telling the truth, there are 3^^^^3 + 1 people only one of which gets the offer and the others get the payout, so on priors it's at least 1/3^^^^3 against for you to experience recieving an offer, and you should consider it proportionally unlikely.
I don't think people realise how much astronomically more likely it is to truthfully be told "God created this paradise for you and your enormous circle of friends to reward an alien for giving him his wallet with zero valid reasoning whatsoever" than to be truthfully asked by that same Deity for your stuff in exchange for the distant unobservable happiness of countless strangers.
More generally, you can avoid most flavours of adversarial muggings with 2 rules: first don't make any trade that an ideal agent wouldn't make (because that's always some kind of money pump), and second don't make any trade that looks dumb. Not making trades can cost you in terms of missed opportunities, but you can't adversarially exploit the trading strategy of a rock with "no deal" written on it.
Hello, I read much of the background material over the past few years but am a new account. Not entirely sure what linked me here first but 70% guess is HPMoR. compsci / mathematics background. I have mostly joined due to having ideas slightly too big for my own brain to check and wanting feedback, wanting to infect the community that I get a lot of my memes from with my own original memes, having now read enough to feel like LessWrong is occasionally incorrect about things where I can help, and to improve my writing quality in ways that generalise to explaining things to other people.
If you instead say "evidence of", this makes more sense
Accepted and changed, I'm only claiming some information/entanglement, not absolute proof.
It applies to all transactions (because all transactions are fundamentally about trust)
Would it be clearer to say "markets with perfect information"? The problem I'm trying to describe can only occur with incomplete information / imperfect trust, but doesn't require so little information and trust that transactions become impossible in general. There's a wide middleground of imperfect trust where all of real life happens, and we still do business anyway.
And this is where you lose me. Failure to add value is not an externality. Good competition (offering a more attractive transaction) is not a market failure.
It sure looks like an externality when generally terrible things can happen as a result. I agree that being able to offer a better product is good, and being able to incentivise that is good if it can lead to more better products, but it does also have this side problem that can be harmful enough to be worth considering.
Signaling is a competitive/adversarial game.
Yeah, I know this idea isn't completely original / exists inside broader frameworks already, but I wanted to highlight it more specifically and I haven't found anything identical to this before. Thanks for the feedback.
Weighted by credence means you're scored on Probability*Prediction, which isn't a fair rule.
If I sincerely believe its 60:40 between two options, and I write that down, I expect 36+16=52 payout, but if I write 100:0 I expect 60+0=60 payout, putting more credence on higher probability outcomes gets me more money even in excess of my true beliefs.
Valid, I'm still working on writing up properly the version with full math which is much more complicated, without that math and without payment it consists of people telling their beliefs and being mysteriously believed about them, because everyone knows everyones incentivised to be honest and sincere and the Agreement Theorem says that means they'll agree when they all know everyone elses reasoning.
Possible Example that I think is the minimum case for any kind of market information system like this:
weather.com wants accurate predictions 7 days in advance for a list of measurements that will happen at weather measurement stations around the world, to inform its customers.
It proposes a naive prior, something like every measurement being a random sample from the past history.
It offers to pay $1 million in reward per single expected bit of information about the average sensor which it uses to assess the outcome, predicted before the weekly deadline. That means that if the rain-sensors are all currently estimated at 10% chance of rain, and you move half of them to 15% and the other half to 5%, you should expect 1 million in profit for improving the rain predictions (conditional on your update actually being legitimate).
The market consists of many meteorologists looking at past data and other relevant information they can find elsewhere, and sharing the beliefs they reach about what the measurements will be in the future, in the form of a statistical model / distribution over possible sensor values. After making their own models, they can compare them and consider the ideas others thought of that they didn't, until the Agreement Theorem says that should reach a common agreed prediction about the likelihood of combinations of outcomes.
How they reach agreement is up to them, but to prevent overconfidence you've got the threat that others can just bet against you and if you're wrong you'll lose, and to prevent underconfidence you've got the offer from the customer that they'll pay out for higher information predictions.
That distribution becomes the output of the information market, and the customer pays for the information in terms of how much information it contains over their naive prior, according to their agreed returns.
How payment works is basically that everyone is kept honest by being paid in carefully shaped bets designed to be profitable in expectation if their claims are true and losing in expectation if their claims are false or if the information is made up. If the market knows you're making it up they can call your bluff before the prediction goes out by strongly betting against you, but there doesn't need to be another trader willing to do that: If the difference in prediction caused by you is not a step towards more accuracy then your bet will on-average lose and you'd be better off not playing.
This is insanely high-risk for something like a single boolean market, where often your bet will lose by simple luck, but with a huge array of mostly uncorrelated features to predict anyone actually adding information can expect to win enough bets on average to get their earned profit.