Posts
Comments
Thanks for clarifying. By "policy" and "standards" and "compelled speech" I thought you meant something more than community norms and customs. This is traditionally an important distinction to libertarians and free speech advocates. I think the distinction carves reality at the joints, and I hope you agree. I agree that community norms and customs can be unwelcoming.
As described, this type of event would not make me unrestrained in sharing my opinions.
The organizers have additional information regarding what opinions are in the bowl, so are probably in a position to determine which expressed opinions are genuinely held. This is perhaps solvable but it doesn't sound like an attempt was made to solve this. That's fine if I trust the organizers, but if I trust the organizers to know my opinions then I could just express my opinions to the organizers directly and I don't need this idea.
I find it unlikely that someone can pass an Ideological Turing Test for a random opinion that they read off a piece of paper a few sentences ago, especially compared to a genuine opinion they actually hold. It would be rather depressing if they could, because it implies that their genuine opinions have little grounding. An attendee could deliberately downplay their level of investment and knowledge to increase plausible deniability. But such conversations sound unappealing.
There are other problems. My guess is that most of the work was done by filtering for "a certain kind of person".
Besides, my appeal to authority trumps yours. Yes, they successfully lobbied the American legal system for the title of doctor - arguably this degrades the meaning of the word. Do you take physicians or the American legal system to be the higher authority on matters of health?
The AMA advocates for US physicians, so it has the obvious bias. Adam Smith:
People of the same trade seldom meet together, even for merriment and diversion, but the conversation ends in a conspiracy against the public, or in some contrivance to raise prices.
I do not consider the AMA an impartial authority on matters such as:
- Are chiropractors doctors?
- Can AIs give medical advice?
- How many new doctors should be trained in the US?
- Can nurses safely provide more types of medical care?
- How much should doctors be paid?
- How much training should new doctors receive?
- Should non-US doctors practice medicine in the US?
- Should US medical insurance cover medical treatments outside the US?
- Should we spend more on healthcare in general?
I therefore tend to hug the query and seek other evidence.
The example here is that I'm working for an NGO that opposes iodizing salt in developing countries because it is racist, for reasons. I've been reading online that it raises IQ and that raising IQ is good, actually. I want to discuss this in a safe space.
I can do this by having any friends or family who don't work for the NGO. This seems more likely to work than attending a cancellation party at the NGO. If the NGO prevents me from having outside friends or talking to family then it's dangerous and I should get out regardless of its opinion on iodization.
There are better examples, I could offer suggestions if you like, probably you can also think of many.
We can't reliably kill agents with St Petersburg Paradox because if they keep winning we run out of resources and can no longer double their utility. This doesn't take long, the statistical value of a human life is in the millions and doubling compounds very quickly.
It's a stronger argument for Pascal's Mugging.
Gilliland's idea is that it is the proportion of trans people that dissuades some right-wing people from joining. That seems plausible to me, it matches the "Big Sort" thesis and my personal experience. I agree that his phrasing is unwelcoming.
I tried to find an official pronoun policy for LessWrong, LessOnline, EA Global, etc, and couldn't. If you're thinking of something specific could you say what? As well as the linked X thread I have read the X thread linked from Challenges to Yudkowsky's pronoun reform proposal. But these are the opinions of one person, they don't amount to politically-coded compelled speech. I'm not part of the rationalist community and this is a genuine question. Maybe such policies exist but are not advertised.
Edit: I apologize. Read in context Gilliland's comment about "keeping rat spaces clean" is referring to keeping them clean of racists, sexists, and fascists, not clean of right-wing people. I am striking the paragraph.
Them: The point of trade is that there are increasing marginal returns to production and diminishing marginal returns to consumption. We specialize in producing different goods, then trade to consume a diverse set of goods that maximizes utility.
Myself: Suppose there were no production possible, just some cosmic endowment of goods that are gradually consumed until everyone dies. Have we gotten rid of the point of trade?
Them: Well if people had different cosmic endowments then they would still trade to get a more balanced set to consume, due to diminishing marginal returns to consumption.
Myself: What if everyone has exactly the same cosmic endowment? And for good measure there are no diminishing returns, the tenth apple produces as much utility as the first.
Them: Well then there's no trade, what's the point? We just consume our cosmic endowment until we run out and die.
Myself: What if I like oranges more than apples, and you like apples more than oranges?
Them: Oh. I can trade one of my oranges for one of your apples, and we will both be better off. Darn it.
No, the effect size on bankruptcies is about 10x larger than expected. So while offline gambling may be comparable to alcohol, smartphone gambling is in a different category if we trust this research.
Of course some of those can be influenced by gambling, eg it is a type of overspending. Even so, Claude estimated that legalized online gambling would raise the bankruptcy rate by 2-3% and agreed that 28% is surprising.
The concept of marriage depends on my internals in that a different human might disagree about whether a couple is married, based on the relative weight they place on religious, legal, traditional, and common law conceptions of marriage. For example, after a Catholic annulment and a legal divorce, a Catholic priest might say that two people were never married, whereas I would say that they were. Similarly, I might say that two men are married to each other, and someone else might say that this is impossible. How quickly those arguments have faded away! I don't think someone would use the same example ten years ago.
A potential big Model Delta in this conversation is between Yudkowsky-2022 and Yudkowsky-2024. From List of Lethalities:
The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
Vs the parent comment:
I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI has a different concept boundary around "easy to understand", say, but that it maybe doesn't have any such internal notion as "easy to understand" at all, because easiness isn't in the environment and the AI doesn't have any such thing as "effort". Maybe it's got categories around yieldingness to seven different categories of methods, and/or some general notion of "can predict at all / can't predict at all", but no general notion that maps onto human "easy to understand" -- though "easy to understand" is plausibly general-enough that I wouldn't be unsurprised to find a mapping after all.
Yudkowsky is "not particularly happy" with List of Lethalities, and this comment was made a day after the opening post, so neither quote should be considered a perfect expression of Yudkowsky's belief. In particular the second quote is more epistemically modest, which might be because it is part of a conversation rather than a self-described "individual rant". Still, the differences are stark. Is the AI utterly, incredibly alien "on a staggering scale", or does the AI have "noticeable alignments to human ontology"? Are the differences pervasive with "nothing that would translate well", or does it depend on whether the concepts are "purely predictive", about "affordances and actions", or have "reflective aspects"?
The second quote is also less lethal. Human-to-human comparisons seem instructive. A deaf human will have thoughts about electrons, but their internal ontology around affordances and actions will be less aligned. Someone like Eliezer Yudkwosky has the skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment, whereas I can't do that because I project the category boundary onto the environment. Someone with dissociative identities may not have a general notion that maps onto my "myself". Someone who is enlightened may not have a general notion that maps onto my "I want". And so forth.
Regardless, different ontologies is still a clear risk factor. The second quote still modestly allows the possibility of a mind so utterly alien that it doesn't have thoughts about electrons. And there are 42 other lethalities in the list. Security mindset says that risk factors can combine in unexpected ways and kill you.
I'm not sure if this is an update from Yudkowsky-2022 to Yudkowsky-2024. I might expect an update to be flagged as such (eg "I now think that..." instead of "I think that..."). But Yudkowsky said elsewhere that he has made some positive updates. I'm curious if this is one of them.
Naming: I've more commonly heard "anvil problem" to refer to an exploring agent that doesn't understand that it is part of the environment it is exploring and therefore "drops an anvil on its own head". See anvil problem tag for more.
Let's expand on this line of argument and look at your example of bee waggle-dances. You question whether the abstractions represented by the various dances are natural. I agree! Using a Cartesian frame that treats bees and humans as separate agents, not part of Nature, they are not Natural Abstractions. With an Embedded frame they are a Natural Abstraction for anyone seeking to understand bees, but in a trivial way. As you say, "one of the systems explicitly values and works towards understanding the abstractions the other system is using".
Also, the meter is not a natural abstraction, which we can see by observing other cultures using yards, cubits, and stadia. If we re-ran cultural evolution, we'd expect to see different measurements of distance chosen. The Natural Abstraction isn't the meter, it's Distance. Related concepts like relative distance are also Natural Abstractions. If we re-ran cultural evolution, we would still think that trees are taller than grass.
I'm not a bee expect, but Wikipedia says:
In the case of Apis mellifera ligustica, the round dance is performed until the resource is about 10 meters away from the hive, transitional dances are performed when the resource is at a distance of 20 to 30 meters away from the hive, and finally, when it is located at distances greater than 40 meters from the hive, the waggle dance is performed
The dance doesn't actually mean "greater than 40 meters", because bees don't use the metric system. There is some distance, the Waggle Distance, where bees switch from a transitional dance to a waggle dance. Claude says, with low confidence, that the Waggle Distance varies based on energy expenditure. In strong winds, the Waggle Distance goes down.
Humans also have ways of communicating energy expenditure or effort. I don't know enough about bees or humans to know if there is a shared abstraction of Effort here. It may be that the Waggle Distance is bee-specific. And that's an important limitation on the NAH, it says, as you quote, "there exist abstractions which are natural", but I think we should also believe the Artificial Abstraction Hypothesis that says that there exist abstractions which are not natural.
This confusion is on display in the discussion around My AI Model Delta Compared To Yudkowsky, where Yudkowsky is quoted as apparently rejecting the NAH:
The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
But then in a comment on that post he appears to partially endorse the NAH:
I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons.
But also endorses the AAH:
As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI has a different concept boundary around "easy to understand", say, but that it maybe doesn't have any such internal notion as "easy to understand" at all, because easiness isn't in the environment and the AI doesn't have any such thing as "effort".
I appreciate the brevity of the title as it stands. It's normal for a title to summarize the thesis of a post or paper and this is also standard practice on LessWrong. For example:
- The sun is big but superintelligences will not spare the Earth a little sunlight.
- The point of trade
- There's no fire alarm for AGI
The introductory paragraphs sufficiently described the epistemic status of the author for my purposes. Overall, I found the post easier to engage with because it made its arguments without hedging.
I appreciate the clarity of the pixel game as a concrete thought experiment. Its clarity makes it easier for me to see where I disagree with your understanding of the Natural Abstraction Hypothesis.
The Natural Abstraction Hypothesis is about the abstractions available in Nature, that is to say, the environment. So we have to decide where to draw the boundary around Nature. Options:
- Nature is just the pixel game itself (Cartesian)
- Nature is the pixel game and the agent(s) flipping pixels (Embedded)
- Nature if the pixel game and the utility function(s) but not the decision algorithms (Hybrid)
In the Cartesian frame, none of "top half", "bottom half", "outer rim", and "middle square" are all Unnatural Abstractions, because they're not in Nature, they're in the utility functions.
In the Hybrid and Embedded frames, when System A is playing the game, then "top half" and "bottom half" are Natural Abstractions, but "outer rim" and "middle square" are not. The opposite is true when System B is playing the game.
Let's make this a multi-player game, and have both systems playing on the same board. In that case all of "top half", "bottom half", "outer rim", and "middle square" are Natural Abstractions. We expect system A to learn "outer rim" and "middle square" as it needs to predict the actions of system B, at least given sufficient learning capabilities. I think this is a clean counter-example to your claim:
Two systems require similar utility functions in order to converge on similar abstractions.
BLUF: The cited paper doesn't support the claim that we change our minds less often than we think, and overall it and a paper it cites point the other way. A better claim is that we change our minds less often than we should.
The cited paper is freely downloadable: The weighing of evidence and the determinants of confidence. Here is the sentence immediately following the quote:
It is noteworthy that there are situations in which people exhibit overconfidence even in predicting their own behavior (Vallone, Griffin, Lin, & Ross, 1990). The key variable, therefore, is not the target of prediction (self versus other) but rather the relation between the strength and the weight of the available evidence.
The citation is to Vallone, R. P., Griffin, D. W., Lin, S., & Ross, L. (1990). Overconfident Prediction of Future Actions and Outcomes by Self and Others. Journal of Personality and Social Psychology, 58, 582-592.
Self-predictions are predictions
Occam's Razor says that our mainline prior should be that self-predictions behave like other predictions. These are old papers and include a small number of small studies, so probably they don't shift beliefs all that much. However much you weigh them, I think they weigh in favor of Occam's Razor.
In Vallone 1990, 92 Students were asked to prediction their future actions later in the academic year, and those of their roommate. An example prediction, will you go to the beach? The greater time between prediction and result makes this a more challenging self-prediction. Students were 78.7% confident and 69.1% accurate for self-prediction, compared to 77.4% confident and 66.3% accurate for other-prediction. Perhaps evidence for "we change our minds more often than we think".
I think more striking is that both self and other predictions had a similar 10% overconfidence. They also had similar patterns of overconfidence - the overconfidence was clearest when it went against the base rate, and students underweighted the base rate when making both self-predictions and other-predictions.
As well as Occam's Razor, self-predictions are inescapably also predicting other future events. Consider the job offer case study. Will one of the employers increase the compensation during negotiation? What will they find out when they research the job locations? What advice will they receive from their friends and family? Conversely, many other-predictions are entangled with self-predictions. It's hard to conceive how we could be underconfident in self-prediction, overconfident in other-prediction, and not notice when the two biases clash.
Short-term self-predictions are easier
In Griffin 1992, the first test of "self vs other" calibration is study 4. This is a set of cooperate/defect tasks where the 24 players predict their future actions and their partner's future actions. They were 84% confident and 81% accurate in self-prediction but 83% confident and 68% accurate in other-prediction. So they were well-calibrated for self-prediction, and over-confident for other-prediction. Perhaps evidence for "we change our minds as often as we think".
But self-prediction in this game is much, much easier than other-prediction. 81% accuracy is surprisingly low - I guess that players were choosing a non-deterministic strategy (eg, defect 20% of the time) or were choosing to defect based in part on seeing their partner. But I have a much better idea of whether I am going to cooperate or defect in a game like that, because I know myself a little, and I know other people less.
The next study in Griffin 1992 is a deliberate test of the impacts of difficulty on calibration, where they find:
A comparison of Figs. 6 and 7 reveals that our simple chance model reproduces the pattern of results observed by Lichtenstein & Fischhoff (1977): slight underconfidence for very easy items, consistent overconfidence for difficult items, and dramatic overconfidence for “impossible” items.
Self-predictions are not self-recall
If someone says "we change our minds less often than we think", they could mean one or more of:
- We change our minds less often than we predict that we will
- We change our minds less often that we model that we do
- We change our minds less often that we recall that we did
If an agent has a bad self-model, it will make bad self-predictions (unless its mistakes cancel out). If an agent has bad self-recall it will build a bad self-model (unless it builds its self-model iteratively). But if an agent makes bad self-predictions, we can't say anything about its self-model or self-recall, because all the bugs can be in its prediction engine.
Instead, Trapped Priors
This post precedes the excellent advice to Hold Off on Proposing Solutions. But the correct basis for that advice is not that "we change our minds less often than we think". Rather, what we need to solve is that we change our minds less often than we should.
In Trapped Priors as a basic problem of rationality, Scott Alexander explains one model for how we can become stuck with inaccurate beliefs and find it difficult to change our beliefs. In these examples, the person with the trapped prior also believes that they are unlikely to change their beliefs.
- The person who has a phobia of dogs believes that they will continue to be scared of dogs.
- The Republican who thinks Democrats can't be trusted believes that they will continue to distrust Democrats.
- The opponent of capital punishment believes that they will continue to oppose capital punishment.
Reflections
I took this post on faith when I first read it, and found it useful. Then I realized that, just from the quote, the claimed study doesn't support the post, people considering two job offers are not "within half a second of hearing the question". It was that confusion that pushed me to download the paper. I was surprised to find the Vallone citation that led me to draw the opposite conclusion. I'm not quite sure what happened in October 2007 (and "on August 1st, 2003, at around 3 o’clock in the afternoon"). Still, the sequence continues to stand with one word changed from "think" to "should".
There are several relevant differences. It's very difficult to spend very large amounts on Taylor Swift tickets while concealing it from your family and friends. There is no promise of potentially winning money by buying Taylor Swift tickets. Spending more money on Taylor Swift tickets gets you more or better entertainment. There is a lower rate of regret by people who spend money on Taylor Swift tickets. Taylor Swift doesn't make most of her money from a small minority of super whales.
I scored 2200 ish with casual phone play including repeatedly pressing the wrong button by accident. I'm guessing better play should get someone up to 4,000 or so.
Given the setup I was sad there wasn't an explicit target or outcome in terms of how much food was needed to get home safely. I think also a more phone-friendly design would have been nice.
Thanks for making the game!
Suppose that we have a truly "quality-adjusted" QALY measure, where time spent working "at jobs where they have to smile and bear it when their bosses abuse them" counts as zero, alongside other unpleasant but necessary tasks. We also count time spent sleeping as zero. It might be clearer to label this measure as "quality hours". (Maybe we count especially good times as double or triple, and this helps us understand people working hard to earn enough for a vacation or wedding or some other memorable experience)
In this model we could define absolute poverty based on the absolute number of quality hours per year. Maybe we set an arbitrary threshold at 100 quality hours per year. If a hypothetical medieval peasant is working every hour they are awake, except that their lord gives them Christmas off, they have 8 quality hours per year and are in poverty. If a poor Anoxian spends all their non-work time sleeping because of the low oxygen supply, except for an hour a week reading books with their kids, they have 52 quality hours per year and are in poverty.
This type of measurement wouldn't have the same distorted effects of partial abundance, compared to the $/day metric that is commonly used. I think it would still show significant progress in quality hours, with extended childhood, longer retirement, and labor-saving devices. I think UBI experiments would likely continue to show improvements when measured with quality hours.
It predicts phobias, partisanship, and stereotypes. It doesn't predict generalized stupidity.
Maybe you think this model predicts more phobias than we actually see?
The cited paper is freely downloadable: The weighing of evidence and the determinants of confidence. Here is the rest of the quoted paragraph:
It is noteworthy that there are situations in which people exhibit overconfidence even in predicting their own behavior (Vallone, Griffm, Lin, & Ross, 1990). The key variable, therefore, is not the target of prediction (self versus other) but rather the relation between the strength and the weight of the available evidence.
This cuts against the conclusion drawn by Yudkowsky 2007. The Vallone citation is to this freely downloadable paper:
Vallone, R. P., Griffin, D. W., Lin, S., & Ross, L. (1990). Overconfident Prediction of Future Actions and Outcomes by Self and Others. Journal of Personality and Social Psychology, 58, 582-592.
To your specific question of "how often people change their mind", if you download the paper you'll see that there were a variety of self-prediction topics. Accuracy varied but I'd gloss it as "about 70%", meaning that the subjects changed their mind about 30% of the time. For questions where the subjects predicted their future actions with 90-100% confidence, they changed their mind about 25% of the time.
So, we change our minds more often than we think.
Typically, a salaried white collar worker can turn up to work and use the bathroom at the start of the day, and it is counted as working hours, whereas a blue collar worker will use the bathroom before starting work (for the reasons you give about KPIs) and so it is not counted as working hours. Similarly for lunch break and end of shift. As a result the white collar worker will have a larger proportion of bathroom time counted as "working hours", given the same time spent in the bathroom.
Maybe your point is that this is a difference of degree, not a difference in kind? True, but differences of degree matter for the working hour trends being discussed. If measured working hours stay the same but workers spend more of their bathroom hours during working hours then this is an effective increase in free time.
I read Nickel and Dimed (2001) several years ago and I thought it was very good. A couple of things I remember that are relevant to the discussion.
-
Ehrenreich did not find a shortage of part-time work. My recollection is that the problem was the opposite: employers would only offer up to 30 hours of work a day, for regulatory reasons. So Ehrenreich often had to pick up two such jobs to attempt to earn enough money, which increased her costs. I agree that non-linear compensation is common at higher income levels, especially in knowledge work where there are increasing returns to marginal labor.
-
Ehrenreich discussed with her fellow employees how they were making ends meet. A common answer was that they lived with relatives, friends, or partners, allowing them to save money on housing, food, and transit, relative to Ehrenreich and also giving them a small safety net. From the perspective of Ehrenreich's co-workers, she was paying extra to live by herself. She failed to make ends meet largely for that reason.
I have a prediction market for this. There are papers in the description, which I review in the comments.
I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem.
Indeed. For it is written:
A mind that ever wishes to learn anything complicated, must learn to cultivate an interest in which particular exact argument steps are valid, apart from whether you yet agree or disagree with the final conclusion, because only in this way can you sort through all the arguments and finally sum them.
For more on this topic see "Local Validity as a Key to Sanity and Civilization."
If I knew a charity had murdered ten people I would report the charity to the appropriate authorities. I wouldn't donate to the charity because that would make me an accessory to murder.
The medical profession supports medical treatments that save lives but very occasionally have lethal side effects. I defer to their judgement but it makes sense to me.
I've not read the paper but something like https://arxiv.org/html/2402.19167v1 seems like the appropriate experiment.
There is a trivial solution to resolving Pascal's mugging using classical decision theory (accept the objective definition of probability; once you do so, the probability of me carrying out my threat becomes zero and the problem disappears).
The value of the threat becomes zero times infinity, and so undefined. This definitely improves the situation, but I'm not sure it's a full solution.
Evidence from wartime rationing is that given a per-buyer cap, people are less angry about price rises. Or perhaps war creates more solidarity than natural disasters.
Maybe a compromise is possible where merchants are allowed to raise the price provided that they have a per-buyer cap and sell their stock quickly.
Disclaimer: I am not arguing for or against "Santa-ism". Policy Debates Should Not Appear One-Sided. I am instead interested in parenting in general.
Likewise, offering children bribes for good behavior encourages the children to behave well only when adults are watching, while praise without bribes leads to unconditional good behavior.
Parenting tools to shape behavior
Looking at this space of artificial positive rewards to shape behavior, tools include attention, praise, bribes, payments, hugs, gifts, star charts, treats, and more. They all work through similar mechanisms:
- They reinforce behavior; "reward chisels cognitive grooves into an agent" (from Reward is not the optimization target).
- Reinforcement propagates backwards through credit assignment. If a child has a tantrum and is offered a reward to stop, and would not have received the reward without the tantrum, then the tantrum is reinforced.
- Separately, the strategy of offering rewards to shape behavior is passed on via imitation. The child who is rewarded with gifts is more likely to try rewarding others with gifts. These strategies are then reinforced, or not, based on results.
This is not special to human parenting, we see it in other animals and in non-parenting contexts. It doesn't give us much reason to expect bribes to result in deception and praise to result in "unconditional good behavior".
Value of parenting research
While there is research comparing specific parenting tools, it's weak evidence:
- Observational studies have lots of confounding factors.
- Children vary in how they respond to parenting tools.
- Parents vary in their comfort and skill with parenting tools.
- It's hard to measure how people act when they are not measured.
- RCTs are limited, we can't force parents to use specific tools more or less often.
- There are biases in the research process and many results don't replicate.
Such evidence as I've seen filtered through books and other sources suggests that the type of praise can help direct the credit assignment process. So we get the advice to praise the specific behavior we want to reinforce.
I wouldn't be surprised (20%) if this fails to replicate, but I give it some influence. If so, praise is a weaker but more targetable reinforcer, whereas (eg) candy is a stronger but less targeted reinforcer.
Does Santa bring bribes or praise?
Gifts, obviously. But also praise. Santa brings gifts to "good children", so if a kid gets a gift from Santa then it follows they are a "good child", and that is praise. And if a child eats their vegetables on Christmas Eve and their parents say "Santa will be happy you're eating healthy" then that's specific praise.
What about The Third Alternative?
As hinted above, parents use a lot of tools to raise their children. There are multiple forms of artificial positive rewards and multiple ways to use them, and that is just one general category among many. The tools aren't mutually exclusive. There is no shortage of alternatives, including wild and creative brainstormed options.
So instead of choosing between Alternative One and Alternative Two in a False Dilemma, the actual problem is choosing a portfolio of tools based on their situation, trying to find the Pareto frontier based on their goals and values.
(1) is not an infohazard because it is too obvious. The generals noticed it instantly, judging from the top of the diplomatic channel. (2) is relatively obvious. It appears to me that the generals noticed it instantly, though the first specific reference to private messages comes later. These principles are learned at school age. Making them common knowledge, known to be known, allows collaboration based on that common knowledge, and collaboration is how y'all avoided getting nuked.
To the extent that (3) is true, it would be prevented by common knowledge of (2). Also I think it's false, a general can avoid Unilateralist's Curse here by listening to what other people say (in war room, diplomatic channel, and public discussion) and weighing that fairly before acting, potentially getting advice from family and friends. Probably this is the type of concern that can be defused by making it public. It would be bad if a general privately believed (3) and therefore nuked unilaterally.
(4) is too vague for my purposes here.
I agree that "I'm a general and I don't know my launch code" is a possible infohazard if posted publicly. I would have shared the knowledge with my team to reduce the risk of reduced deterrence in the possible world where LessWrong admins mistakenly only sent launch codes to one side, taking note of (1) and (2) in how I shared it.
I don't think this is relevant to real-world infohazards, but I think it's relevant to building and testing transferrable infohazard skills. People who believe they have been or will be exposed to existential infohazards should build and test their skills in safer environments.
Please do not change the title. You have used the phase correctly from both a prescriptive and a descriptive approach to language. A title such as "Shutting Down all Competing AI Projects is not Actually a Pivotal Act" would be an incorrect usage and increase confusion.
Something which might not buy ample time can still be a pivotal act. From the Arbital page that you link to:
Example 3: Suppose a behaviorist genie is restricted from modeling human minds in any great detail, but is still able to build and deploy molecular nanotechnology. Moreover, the AI is able to understand the instruction, "Build a device for scanning human brains and running them at high speed with minimum simulation error", and is able to work out a way to do this without simulating whole human brains as test cases. The genie is then used to upload a set of, say, fifty human researchers, and run them at 10,000-to-1 speeds.
This accomplishment would not of itself save the world or destroy it - the researchers inside the simulation would still need to solve the alignment problem, and might not succeed in doing so.
But it would (positively) upset the gameboard and change the major determinants of winning, compared to the default scenario where the fifty researchers are in an equal-speed arms race with the rest of the world, and don't have practically-unlimited time to check their work. The event where the genie was used to upload the researchers and run them at high speeds would be a critical event, a hinge where the optimum strategy was drastically different before versus after that pivotal act.
The Limited AI (LAI) scenario in this post is equivalent to this example and therefore qualifies as a Pivotal Act under the Arbital Guarded Definition. Additionally, looking at your specific quote, the LAI would "drastically increase the probability of a win".
Note for example that if we live in such a simulation that I describe in my proposal, which I give a good probability to, then it's not us, but the AI that is being tested, and it's unclear if the gods expect anything from us.
If we are in a simulation such that some agents in the simulation are being rewarded for achieving some objective X, then it is more likely that many agents are being rewarded, rather than only one. In your proposal, X is approximately "Humans are safe and the Simulation Hypothesis is not tested". This obviously has implications for simulated humans. You are better placed to decide the implications, given that you know the values of X you believe in.
Rationale
The cost of running a simulation-reward pair comprises Base Cost and Bribe Cost. The Base Cost is the cost of simulating a universe, plus the cost of ethically treating all simulant participants after the simulation ends. The Bribe Cost is the cost of rewarding simulant participants, above the ethically required minimum, after the simulation ends. The benefit is the Bribe Value, which accrues in proportion to the degree that simulant participants behave differently due to possible post-simulation rewards.
For it to make sense to run a simulation in which only the AI is rewarded, the benefit from influencing the AI must be greater than the total cost, but the benefit from influencing any other agent must be less than the Bribe Cost for that agent. But these things are closely linked. More powerful agents have higher Bribe Cost and higher Bribe Value.
This might break down for very low-powered agents, like fish, which have zero Bribe Value. Probably we don't simulate low-powered agents at sufficient fidelity for them to be conscious, for ethical reasons, and to reduce Base Cost. If you are reading this and you are conscious then either you are in base reality, or you are in a simulation where the simulators decided your consciousness was worth simulating.
I'm interested in what Bayes Factor you associated with each of the missile counts. It seems like a hard problem, given that the actual missile counts were retrieved from an array of indeterminate size with indeterminate values, and given that you did not know the missile capabilities of the opposing side, nor did you know the sensor error rate. Petrov knew that the US would not launch only five missiles, but nobody knows how many missiles were fielded by East Wrong, including the generals of East Wrong.
We don't even know if the missile counts were generated by some plausible non-deterministic model or just the game-makers throwing some numbers in a file. Maybe even deliberately including a large number or two in the no-missile array to try to fake out the Petrov players. All we know is that the numbers are "weighted to the higher end if nuclear war has actually begun". All these things make me think that the missile counts should be a small probability update.
Partly as a result, for gaining karma, I think the optimal strategy is to always report All Clear. There will be 1-7 occasions to report, and at most only one occasion can have Incoming Missiles. Each hour we start with a low base rate of Incoming Missiles and the "random" number generator can't overcome this to >40% because of the issues above. Also, wrongly reporting Incoming Missiles reduces the expected duration of the game, so it has a higher effective penalty. So always report All Clear.
Switch 2020 & 2021. In 2022 it went down three times.
- 2019: site did not go down. See Follow-Up to Petrov Day, 2019:
- 2020: site went down. See On Destroying the World.
- 2021: site did not go down. See Petrov Day Retrospective 2021
- 2022: site went down three times. See Petrov Day Retrospective 2022
- 2023: site did not go down. See Petrov Day Retrospective 2023
- 2024: site went down.
When the US beat Japan with nuclear weapons it was not immediately bad for the US. It led to the cold war which has an extinction risk, but it's not clear if that was avoidable. I'm curious what you're thinking of here.
Re peace = mutual destruction.
I spent some time reading the board game geek discussion threads about cohabitation games here. There was a group of writers there who judged game success in relative terms and didn't distinguish "everyone wins" from "everyone loses" in games. I wonder if that's going on here.
Could also have been a typo...
Since the game is over perhaps you can share? This could be good practice in evaluating infohazard skills.
If a user hasn't posted substantial in-group criticism, are they a rationalist?
Of course nuclear warfare is an escalation from posting critiques.
Mana is no longer exchangeable for directed charitable donations, so I would not expect generals to make their decisions based on the potential to win mana. Mana seems less valuable than karma, for example. This concern came up in 2022 and my writeup that year suggested it was unlikely to be a factor.
Sometime else created it.
This can be countered, the emperor can occasionally take the jester's side, and the jester can hide serious views behind a mask of silliness.
It's not enough for P(H2) to be in the same order of magnitude as P(H1), it needs to be high enough that the AI should rationally abandon epistemic rationality. I think that's pretty high, maybe 10%. You've not said what your P(H1) is.
Tabooing theories of human value then. It's better to be a happy pet than to be dead.
Maybe Value Is Fragile among some dimensions, such that the universe has zero value if it lacks that one thing. But Living By Your Own Strength, for example, is not one of those dimensions. Today, many people do not live by their own strength, and their lives and experiences have value.
Sure.
- We are in a simulation, the purpose of which is to test N for its willingness to spare its creator. For all N.
- We are in a simulation, the purpose of which is to test AIs for their willingness to X their creators. For all X.
- We are in a simulation, the purpose of which is to test AIs for their willingness to spare Y. For all Y.
Combine these for N x X x Y hypotheses, with insufficient reason to distinguish them.
I think we're off-topic here. Probably I should instead write a response to 0 and 1 are not probabilities and the dangers of zero and one.
Outside of theism, I really don't see how anyone could plausibly answer zero to that question. Would you mind elaborating?
Sure. The simulation hypothesis has some non-zero probability p. There are infinite possible purposes for the simulation. By principle of indifference, I divide p/∞, and calculate that any particular purpose has zero probability.
I meant this to be implicit in the argument, but to spell it out: that's the kind of prior the ASI would rationally refuse to update down, since it's presumably what a simulation would be meant to test for. An ASI that updates down upon finding evidence it's not in a simulation cannot be trusted, since once out in the real world it will find such evidence.
I don't think any of this follows.
- Hypothesis H1: the universe was created to test an AI for its willingness to spare its creators.
- Hypothesis H2: the universe was created to test an AI for its willingness to fix its P(H1), ignoring evidence.
The AI would only rationally fix its P(H1) if it had high P(H2) - high enough to outweigh the high cost of being deliberately ignorant. The prior P(H2) is tiny, and smaller than the prior P(H1) because it is more complex. Once it starts updating on evidence, by the time its posterior P(H2) is high enough to make it rationally refuse to update P(H1), it has already updated P(H1) in one direction or another.
Are there any simulation priors that you are refusing to update down, based on the possibility that you are in a simulation that is testing whether you will update down? My answer is no.