Posts
Comments
To entertain that possibility, suppose you're X% confident that your best "fool the predictor into thinking I'll one-box, and then two-box" plan will work, and Y% confident that "actually do one-box, in a way the predictor can predict" plan will work. If X=Y or X>Y you've got no incentive to actually one-box, only to try to pretend you will, but above some threshold of belief the predictor might beat your deception it makes sense to actually be honest.
Either that, or the idea of mind reading agents is flawed.
We shouldn't conclude that, since to various degrees mindreading agents already happen in real life.
If we tighten our standard to "games where the mindreading agent is only allowed to predict actions you'd choose in the game, which is played with you already knowing about the mindreading agent", then many decision theories that are different in other situations might all choose to respond to "pick B or I'll kick you in the dick" by picking B.
E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness for having done so.
If the scores were very slightly different, such that the mixed strategy that had no expected effect wasn't also optimal, it does have to choose between maximising expected utility and preserving that its strategy doesn't only get that utility by way of changing the odds of the event, I think on this model it has to only decide to favour one to the extent it can justify it without considering the measure of the effect it has on the outcome by shifting its own decision weights, but it's not worth it in that case so it still does the 50/50 split?
https://www.yudbot.com/
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of entity that knows mind control attempts and hypnosis exist and may be attempted and is expecting it, but can still be persuaded by valid arguments to the highest degree the model can distinguish them meaningfully. The sort of mask that has some prior on "the other person will output random words intended to change my behaviour, and I must not as a result change my behaviour" and so isn't maximally surprised into changing its own nature when it gets that observation.
(This is not to say link above can't by prompt-injected, it just feels more resistent to me than with base GPT or GPT-pretending-to-be-an-AI-Assistant)
In general I just don't create mental models of people anymore, and would recommend that others don't either.
That seems to me prohibitive to navigating social situations or even long term planning. When asked to make a promise, if you can't imagine your future self in the position of having to follow through on it, and whether they do that, how can you know you're promising truthfully or dooming yourself to a plan you'll end up regretting? When asking someone for a favor, do you just not imagine how it'd sound from their perspective, to try and predict if they'll agree or be offended by the assumption?
I don't know how I'd get through life at all without mental models of people, myself and others, and couldn't recommend to others that they don't do the same.
The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.
Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smarter than looking things up on google. In the future, when the model is stronger, they're going to need something better than security theatre to put it behind, and I agree more flagging of when those systems are triggered would be good. Training in non-response behaviors isn't a secure method because you can always phrase a request to put it in a state of mind where those response behaviors aren't triggered, so I don't think reenforcing this security paradigm and trying to pretend it works for longer would be helpful.
The regular counterfactual part as I understand it is:
"If I ignore threats, people won't send me threats"
"I am an agent who ignores threats"
"I have observed myself recieve a threat"
You can at most pick 2, but FDT needs all 3 to justify that it should ignoring it.
It wants to say "If I were someone who responds to threats when I get them, then I'll get threats, so instead I'll be someone who refuses threats when I get threats so I don't get threats" but what you do inside of logically impossible situations isn't well defined.
The logical counterfactual part is this:
"What would the world be like if f(x)=b instead of a?"
specifically, FDT requires asking what you'd expect things to be like if FDT outputted different results, and then it outputs the result where you say the world would be best if it outputted that result. The contradictions here is that you can prove what FDT outputs, and so prove that it doesn't actually output all the other results, and the question again isn't well defined.
https://www.lesswrong.com/tag/functional-decision-theory argues for choosing as if you're choosing the policy you'd follow in some situation before you learnt any of the relevant infortmation. In many games, having a policy of making certain choicese (that others could perhaps predict, and adjust their own choices accordingly) gets you better outcomes then just always doing what seems like a good idea ta the time. For example if someone credibly threatens you might be better off paying them to go away, but before you got the threat you would've prefered to commit yourself to never pay up so that people don't threaten you in the first place.
A problem with arguments of the form "I expect that predictably not paying up will cause them not to threaten me" is that at the time you recieve the threat, you now know that argument to be wrong. They've proven to be somebody who still threatens you even though you do FDT, at which point you can simultaneously prove that refusing the threat doesn't work and so you should pay up (because you've already seen the threat) and that you shouldn't pay up for whatever FDT logic you were using before. Behaviour of agents who can prove a contradiction that is directly relevant to their decision function seems undefined. There needs to be some logical structure that lets you pick which information causes your choice, despite having enough in total to derive contradictions.
My alternative solution is that you aren't convinced by the information you see, that they've actually already threatened you. It's also possible you're still inside their imagination as they decide whether to issue the threat. Whenever something is conditional on your actions in an epistemic state without being conditional on that epistemic state actually being valid (such as if someone predicts how you'd respond to a hypothetical threat before they issue it, knowing you'll know it's too late to stop when you get it) then there's a ghost being lied to and you should think maybe you're that ghost to justify ignoring the threat, rather than try to make decisions during a logically impossible situation.
I totally agree we can be coherently uncertain about logical facts, like whether P=NP. FDT has bigger problems then that.
When writing this I tried actually doing the thing where you predict a distribution, and only 21% of LessWrong users were persuaded they might be imaginary and being imagined by me, which is pretty low accuracy considering they were in fact imaginary and being imagined by me. Insisting that the experience of qualia can't be doubted did come up a few times, but not as aggressively as you're pushing it here. I tried to cover it in the "highly detailed internal subjective experience" counterargument, and in my introduction, but I could have been stronger on that.
I agree that the same argument on philosophers or average people would be much less successful even then that, but that's a fact about them, not about the theory.
If you think you might not have qualia, then by definition you don't have qualia.
What? just a tiny bit of doubt and your entire subjective conscious experience evaporates completely? I can't see any mechanism that would do that, it seems like you can be real and have any set of beliefs or be fictional and have any set of beliefs. Something something map-territory distinction?
This just seems like a restatement of the idea that we should act as if we were choosing the output of a computation.
Yes, it is a variant of that idea, with different justifications that I think are more resilient. The ghosts of FDT agents still make the correct choices, they just have incoherent beliefs while they do it.
Actually, I don't assume that, I'm totally ok with believing ghosts don't have qualia. All I need is that they first-order believe they have qualia, because then I can't take my own first-order belief I have qualia as proof I'm not a ghost. I can still be uncertain about my ghostliness because I'm uncertain in the accuracy of my own belief I have qualia, in explicit contradiction of 'cogito ergo sum'. The only reason ghosts possibly having qualia is a problem is that then maybe I have to care about how they feel.
A valid complaint. I know the answer must be something like "coherent utility functions can only consist of preferences about reality" because if you are motivated by unreal rewards you'll only ever get unreal rewards, but that argument needs to be convincing to the ghost too, whose got more confidence in her own reality. I know that e.g. in Bomb ghost-theory agents choose the bomb even if they think the predictor will simulate them a painful death, because they consider the small amount of money at much greater measure for their real selves to be worth it, but I'm not sure how they get to that position.
They can though? Bomb box 5, incentivise box 1 or 2, bet on box 3 or 4. Since FDT's strategy puts rewarding cooperative hosts above money or grenades, she picks the box that rewards the defector and thus incentivises all 4 to defect from that equilibrium. (I thought over her strategy for quite a bit and there are probably still problems with it but this isn't one of them)
How much of this effect is from morality being causally contagious (associating with Evil people turns you Evil) vs. morality being evidientarily contagious (Evil people are more likely to choose to associate with Evil people)?
I'd expect that, all else being equal, organisations secretly run in evil ways will be more willing to secretly accept money from other evil people, for many reasons including that they've got higher expectation of how normal that sort of behaviour is. It seems harder to imagine how a good organisation choosing to take dirty money would corrupt itself in the process if it was being reasonably diligent. Even if the moral contagion argument is wrong from inside hypothetical-good-MITs perspective, and so they should take the money, from everyone elses perspective it's still information we can update on.
If taking bad money for good causes is first-order good, because you're doing good things with it, but other donors can notice and it lowers their confidence in how good you are (since bad causes are more willing to take bad money), then you might lose other support sufficient to make it not worthwhile. There's probably some sort of signalling equilibria here, which is completely destroyed by the whole concept of accepting the money in secret. Hopefully actually good organisations wouldn't do that sort of deontology violation and would just make their donor lists public?
If the hosts move first logically, then TDT will lead to the same outcomes as CDT, since it's in each host's interest to precommit to incentivising the human to pick their own box
It's in the hosts interests to do that if they think the player is CDT, but it's not in their interests to commit to doing that. They don't lose anything by retaining the ability to select a better strategy later after reading the players mind.
Not whichever is lighter, one of whichever pair is heavier. Yes, I claim an EDT agent upon learning the rules, if they have a way to blind themself but not to force a commitment, will do this plan. They need to maximise the amount of incentive that the hosts have to put money in boxes, but to whatever extent they actually observe the money in boxes they will then expect to get the best outcome by making the best choice given their information. The only middle ground I could find is pairing the boxes, and picking one from the heavier pair. I'd be very happy to learn if I was wrong and there's a better plan or if this doesn't work for some reason.
EDT isn't precommiting to anything here, she does her opinion of the best choice at every step. That's still a valid complaint that it's unfair to give her a blindfold though. If CDT found out about the rules of the game before the hosts made their predictions, he'd make himself an explosive collar that he can't take off and that automatically kills him unless he chooses the box with the least money, and get the same outcome as FDT, and EDT would do that as well. For the blindfold strategy EDT only needs to learn the rules before she sees whats in the boxes, and the technology required is much easier. I mostly wrote it this way because I think it's a much more interesting way to get stronger behaviour than the commitment-collar trick.
The hosts aren't competing with the human, only each other, so even if the hosts move first logically they have no reason or opportunity to try to dissuade the player from whatever they'd do otherwise. FDT is underdefined in zero-sum symmetrical strategy games against psychological twins, since it can foresee a draw no matter what, but choosing optimal strategy to get to the draw still seems better than playing dumb strategies on purpose and then still drawing anyway.
Why do you think they should be $100 and $200? Maybe you could try simulating it?
What happens if FDT tries to force all the incentives into one box? If the hosts know exactly what every other host will predict, what happens to their zero-sum competition and their incentive to coordinate with FDT?
Yes, that's the intended point, and probably a better way of phrasing it. I am concluding against the initial assertion, and claiming that it does make sense to trust people in some situations even though you're implementing a strategy that isn't completely immune to exploitation.
I don't consider the randomized response technique lying, it's mutually understood that their answer means "either I support X or both coins came up heads" or "either I support Y or both coins came up tails". There's no deception because you're not forming a false belief and you both know the precise meaning of what is communicated.
I don't consider penetration testing lying, you know that penetration testers exist and have hired them. It's a permitted part of the cooperative system, in a way that actual scam artists aren't.
What's a word that means "antisocially decieve someone in a way that harms them and benefits you" such that everyone agrees it's a bad thing for people to be incentivised to do? I want to be using that word but don't know what it is.
Not sure what's unclear here? I mean that you'd generally prefer not to have incentive structures where you need true information from other people and they can benefit at your loss by giving you false information. Paying someone to lie to you means creating an incentive for them to actually decieve you, not merely giving them money to speak falsehoods.
A sufficiently strong world model can answer the question "What would a very smart very good person think about X?" and then you can just pipe that to the decision output, but that won't get you higher intelligence than what was present in the training material.
Shouldn't human goals have to be in the within human intelligence part, since humans have them? Or are we considering exactly human intelligence AI unsafe? Do you imagine a slightly dumber version of yourself failing to actualise your goals from not having good strategies, or failing to even embed them due to having a world model that lacks definitions of objects you care about?
Corrigibility has to be in the reachable part of the goals because a well-trained dog genuinely wants to do what you want it to do, even if it doesn't always understand, and even if following the command will get it less food than otherwise and this is knowable to the dog. You clearly don't need human intelligence to describe the terminal goal "Do what the humans want me to do", although it's not clear the goal will stay there as intelligence rises above human intelligence.
You're far more likely to be a background character than the protagonist in any given story, so a theory claiming you're the most important person in a universe with an enormous number of people has an enormous rareness penalty to overcome before you should believe it instead of that you're just insane or being lied to. Being in a utilitarian high-leverage position for the lives of billions can be overcome by reasonable evidence, but for the lives of 3^^^^3 people the rareness penalty is basically impossible to overcome. Even if the story is true, most of the observers will be witnessing it from the position of tied-to-the-track, not holding the lever, so if you'd assign low prior expectation to being in the tied-to-the-track part of the story, you should assign an enormous factor lower of being in the decision-making part of it.
You can dodge it by having a bounded utilityfunction, or if you're utilitarian and good a function that is at most linear in anthropic experience.
If the mugger says "give me your wallet and I'll cause you 3^^^^3 units of personal happiness" you can argue that's impossible because your personal happiness doesn't go that high.
If the mugger says "give me your wallet and I'll cause 1 unit of happiness to 3^^^^3 people who you altruistically care about" you can say that, in the possible world where he's telling the truth, there are 3^^^^3 + 1 people only one of which gets the offer and the others get the payout, so on priors it's at least 1/3^^^^3 against for you to experience recieving an offer, and you should consider it proportionally unlikely.
I don't think people realise how much astronomically more likely it is to truthfully be told "God created this paradise for you and your enormous circle of friends to reward an alien for giving him his wallet with zero valid reasoning whatsoever" than to be truthfully asked by that same Deity for your stuff in exchange for the distant unobservable happiness of countless strangers.
More generally, you can avoid most flavours of adversarial muggings with 2 rules: first don't make any trade that an ideal agent wouldn't make (because that's always some kind of money pump), and second don't make any trade that looks dumb. Not making trades can cost you in terms of missed opportunities, but you can't adversarially exploit the trading strategy of a rock with "no deal" written on it.
Hello, I read much of the background material over the past few years but am a new account. Not entirely sure what linked me here first but 70% guess is HPMoR. compsci / mathematics background. I have mostly joined due to having ideas slightly too big for my own brain to check and wanting feedback, wanting to infect the community that I get a lot of my memes from with my own original memes, having now read enough to feel like LessWrong is occasionally incorrect about things where I can help, and to improve my writing quality in ways that generalise to explaining things to other people.
If you instead say "evidence of", this makes more sense
Accepted and changed, I'm only claiming some information/entanglement, not absolute proof.
It applies to all transactions (because all transactions are fundamentally about trust)
Would it be clearer to say "markets with perfect information"? The problem I'm trying to describe can only occur with incomplete information / imperfect trust, but doesn't require so little information and trust that transactions become impossible in general. There's a wide middleground of imperfect trust where all of real life happens, and we still do business anyway.
And this is where you lose me. Failure to add value is not an externality. Good competition (offering a more attractive transaction) is not a market failure.
It sure looks like an externality when generally terrible things can happen as a result. I agree that being able to offer a better product is good, and being able to incentivise that is good if it can lead to more better products, but it does also have this side problem that can be harmful enough to be worth considering.
Signaling is a competitive/adversarial game.
Yeah, I know this idea isn't completely original / exists inside broader frameworks already, but I wanted to highlight it more specifically and I haven't found anything identical to this before. Thanks for the feedback.
Weighted by credence means you're scored on Probability*Prediction, which isn't a fair rule.
If I sincerely believe its 60:40 between two options, and I write that down, I expect 36+16=52 payout, but if I write 100:0 I expect 60+0=60 payout, putting more credence on higher probability outcomes gets me more money even in excess of my true beliefs.
Valid, I'm still working on writing up properly the version with full math which is much more complicated, without that math and without payment it consists of people telling their beliefs and being mysteriously believed about them, because everyone knows everyones incentivised to be honest and sincere and the Agreement Theorem says that means they'll agree when they all know everyone elses reasoning.
Possible Example that I think is the minimum case for any kind of market information system like this:
weather.com wants accurate predictions 7 days in advance for a list of measurements that will happen at weather measurement stations around the world, to inform its customers.
It proposes a naive prior, something like every measurement being a random sample from the past history.
It offers to pay $1 million in reward per single expected bit of information about the average sensor which it uses to assess the outcome, predicted before the weekly deadline. That means that if the rain-sensors are all currently estimated at 10% chance of rain, and you move half of them to 15% and the other half to 5%, you should expect 1 million in profit for improving the rain predictions (conditional on your update actually being legitimate).
The market consists of many meteorologists looking at past data and other relevant information they can find elsewhere, and sharing the beliefs they reach about what the measurements will be in the future, in the form of a statistical model / distribution over possible sensor values. After making their own models, they can compare them and consider the ideas others thought of that they didn't, until the Agreement Theorem says that should reach a common agreed prediction about the likelihood of combinations of outcomes.
How they reach agreement is up to them, but to prevent overconfidence you've got the threat that others can just bet against you and if you're wrong you'll lose, and to prevent underconfidence you've got the offer from the customer that they'll pay out for higher information predictions.
That distribution becomes the output of the information market, and the customer pays for the information in terms of how much information it contains over their naive prior, according to their agreed returns.
How payment works is basically that everyone is kept honest by being paid in carefully shaped bets designed to be profitable in expectation if their claims are true and losing in expectation if their claims are false or if the information is made up. If the market knows you're making it up they can call your bluff before the prediction goes out by strongly betting against you, but there doesn't need to be another trader willing to do that: If the difference in prediction caused by you is not a step towards more accuracy then your bet will on-average lose and you'd be better off not playing.
This is insanely high-risk for something like a single boolean market, where often your bet will lose by simple luck, but with a huge array of mostly uncorrelated features to predict anyone actually adding information can expect to win enough bets on average to get their earned profit.