Posts

LDT (and everything else) can be irrational 2024-11-06T04:05:36.932Z
Acausal Now: We could totally acausally bargain with aliens at our current tech level if desired 2023-08-09T00:50:50.564Z
Necromancy's unintended consequences. 2023-08-09T00:08:41.656Z
How do low level hypotheses constrain high level ones? The mystery of the disappearing diamond. 2023-07-11T19:27:48.756Z
Challenge proposal: smallest possible self-hardening backdoor for RLHF 2023-06-29T16:56:59.832Z
Anthropically Blind: the anthropic shadow is reflectively inconsistent 2023-06-29T02:36:26.347Z
Solomonoff induction still works if the universe is uncomputable, and its usefulness doesn't require knowing Occam's razor 2023-06-18T01:52:25.769Z
Demystifying Born's rule 2023-06-14T03:16:20.941Z
Current AI harms are also sci-fi 2023-06-08T17:49:59.054Z
Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program 2023-06-02T21:54:56.291Z
The unspoken but ridiculous assumption of AI doom: the hidden doom assumption 2023-06-01T17:01:49.088Z
What projects and efforts are there to promote AI safety research? 2023-05-24T00:33:47.554Z
Seeing Ghosts by GPT-4 2023-05-20T00:11:52.083Z
We are misaligned: the saddening idea that most of humanity doesn't intrinsically care about x-risk, even on a personal level 2023-05-19T16:12:04.159Z
Proposal: we should start referring to the risk from unaligned AI as a type of *accident risk* 2023-05-16T15:18:55.427Z
PCAST Working Group on Generative AI Invites Public Input 2023-05-13T22:49:42.730Z
The way AGI wins could look very stupid 2023-05-12T16:34:18.841Z
Are healthy choices effective for improving live expectancy anymore? 2023-05-08T21:25:45.549Z
Acausal trade naturally results in the Nash bargaining solution 2023-05-08T18:13:09.114Z
Formalizing the "AI x-risk is unlikely because it is ridiculous" argument 2023-05-03T18:56:25.834Z
Accuracy of arguments that are seen as ridiculous and intuitively false but don't have good counter-arguments 2023-04-29T23:58:24.012Z
Proposal: Using Monte Carlo tree search instead of RLHF for alignment research 2023-04-20T19:57:43.093Z
A poem written by a fancy autocomplete 2023-04-20T02:31:58.284Z
What is your timelines for ADI (artificial disempowering intelligence)? 2023-04-17T17:01:36.250Z
In favor of accelerating problems you're trying to solve 2023-04-11T18:15:07.061Z
"Corrigibility at some small length" by dath ilan 2023-04-05T01:47:23.246Z
How to respond to the recent condemnations of the rationalist community 2023-04-04T01:42:49.225Z
Do we have a plan for the "first critical try" problem? 2023-04-03T16:27:50.821Z
AI community building: EliezerKart 2023-04-01T15:25:05.151Z
Imagine a world where Microsoft employees used Bing 2023-03-31T18:36:07.720Z
GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2 2023-03-31T17:05:05.378Z
GPT-4 is bad at strategic thinking 2023-03-27T15:11:47.448Z
More experiments in GPT-4 agency: writing memos 2023-03-24T17:51:48.660Z
Does GPT-4 exhibit agency when summarizing articles? 2023-03-24T15:49:34.420Z
A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world! 2023-03-24T01:19:41.298Z
GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that's incorrect if they differ 2023-03-23T16:16:25.588Z
Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned 2023-03-21T03:53:30.797Z
Capabilities Denial: The Danger of Underestimating AI 2023-03-21T01:24:02.024Z
ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so 2023-03-15T00:29:23.523Z
A better analogy and example for teaching AI takeover: the ML Inferno 2023-03-14T19:14:44.790Z
Could Roko's basilisk acausally bargain with a paperclip maximizer? 2023-03-13T18:21:46.722Z
A ranking scale for how severe the side effects of solutions to AI x-risk are 2023-03-08T22:53:11.224Z
Is there a ML agent that abandons it's utility function out-of-distribution without losing capabilities? 2023-02-22T16:49:01.190Z
Bing finding ways to bypass Microsoft's filters without being asked. Is it reproducible? 2023-02-20T15:11:28.538Z
Threatening to do the impossible: A solution to spurious counterfactuals for functional decision theory via proof theory 2023-02-11T07:57:16.696Z
Is this a weak pivotal act: creating nanobots that eat evil AGIs (but nothing else)? 2023-02-10T19:26:00.817Z
Optimality is the tiger, and annoying the user is its teeth 2023-01-28T20:20:33.605Z

Comments

Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-26T16:59:56.102Z · LW · GW

Also, you should care about worlds proportional to the square of their amplitude.

It's actually interesting to consider why this must be the case. Without it, I concede that maybe some sort of Quantum Anthropic Shadow could be true. I'm thinking it would lead to lots of wacky consequences.

Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-26T16:55:28.221Z · LW · GW

I suppose the main point you should draw from "Anthropic Blindness" to QI is that:

  1. Quantum Immortality is not a philosophical consequence of MWI, it is an empirical hypothesis with a very low prior (due to complexity).
  2. Death is not special. Assuming you have never gotten a Fedora up to this point, it is consistent to assume that that "Quantum Fedoralessness" is true. That is, if you keep flipping a quantum coin that has a 50% chance of giving you a Fedora, the universe will only have you experience the path that doesn't give you the Fedora. Since you have never gotten a Fedora yet, you can't rule this hypothesis out. The silliness of this example demonstrates why we should likewise be skeptical of Quantum Immortality.
Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-22T22:27:26.734Z · LW · GW

A universe with classical mechanics, except that when you die the universe gets resampled, would be anthropic angelic.

Beings who save you are also anthropic angelic. For example, the fact that you don't die while driving is because the engineers explicitly tried to minimize your chance of death. You can make inferences based on this. For example, even if you have never crashed, you can reason that during a crash you will endure less damage than other parts of the car, because the engineers wanted to save you more than they wanted to save the parts of the car.

Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-22T20:43:06.892Z · LW · GW

No, the argument is that the traditional (weak) evidence for anthropic shadow is instead evidence of anthropic angel. QI is an example of anthropic angel, not anthropic shadow.

So for example, a statistically implausible number of LHC failures would be evidence for some sort of QI and also other related anthropic angel hypotheses, and they don't need to be exclusive.

Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-22T18:47:14.627Z · LW · GW

The more serious problem is that quantum immortality and angel immortality eventually merges

An interesting observation, but I don't see how that is a problem with Anthropically Blind? I do not assert anywhere that QI and anthropic angel are contradictory. Rather, I give QI as an example of an anthropic angel.

Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-07T19:56:57.166Z · LW · GW

"I am more likely to be born in the world where life extensions technologies are developing and alignment is easy". Simple Bayesian update does not support this.

I mean, why not?

P(Life extension is developing and alignment is easy | I will be immortal) = P(Life extension is developing and alignment is easy) * (P(I will be immortal | Life extension is developing and alignment is easy) / P(I will be immortal))

Comment by Christopher King (christopher-king) on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-07T17:25:41.527Z · LW · GW

Believing QI is the same as a Bayesian update on the event "I will become immortal".

Imagine you are a prediction market trader, and a genie appears. You ask the genie "will I become immortal" and the genie answers "yes" and then disappears.

Would you buy shares on a Taiwan war happening?

If the answer is yes, the same thing should apply if a genie told you QI is true (unless the prediction market already priced QI in). No weird anthropics math necessary!

Comment by Christopher King (christopher-king) on LDT (and everything else) can be irrational · 2024-11-07T14:40:08.077Z · LW · GW

LDT decision theories are probably the best decision theories for problems in the fair problem class.

The post demonstrates why this statement is misleading.

If "play the ultimatum game against a LDT agent" is not in the fair problem class, I'd say that LDT shouldn't be in the "fair agent class". It is like saying that in a tortoise-only race, the best racer is a hare because a hare can beat all the tortoises.

So based on the definitions you gave I'd classify "LDT is the best decision theory for problems in the fair problem class" as not even wrong.

In particular, consider a class of allowable problems S, but then also say that an agent X is allowable only if "play a given game with X" is in S. Then the proof in the No agent is rational in every problem section of my proof goes through for allowable agents. (Note that that argument in that section is general enough to apply to agents that don't give into $9 rock.)

Practically speaking: if you're trying to follow decision theory X, than playing against other X is a reasonable problem

Comment by Christopher King (christopher-king) on How to Give in to Threats (without incentivizing them) · 2024-11-06T18:03:38.015Z · LW · GW

Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?

There is a no free lunch theorem for this. LDT (and everything else) can be irrational

Comment by Christopher King (christopher-king) on The Hidden Complexity of Wishes · 2024-11-06T17:45:28.953Z · LW · GW

I would suggest formulating this like a literal attention economy.

  1. You set a price for your attention (probably like $1). The price at which even if the post is a waste of time, the money makes it worth it.
  2. "Recommenders" can recommend content to you by paying the price.
  3. If the content was worth your time, you pay the recommender the $1 back plus a couple cents.

The idea is that the recommenders would get good at predicting what posts you'd pay them for. And since you aren't a causal decision theorist they know you won't scam them. In particular, on average you should be losing money (but in exchange you get good content).

This doesn't necessarily require new software. Just tell people to send PayPals with a link to the content.

With custom software, theoretically there could exist a secondary market for "shares" in the payout from step 3 to make things more efficient. That way the best recommenders could sell their shares and then use that money to recommend more content before you payout.

If the system is bad at recommending content, at least you get paid!

Comment by Christopher King (christopher-king) on LDT (and everything else) can be irrational · 2024-11-06T16:40:01.071Z · LW · GW

Yes this would be a no free lunch theorem for decision theory.

It is different from the "No free lunch in search and optimization" theorem though. I think people had an intuition that LDT will never regret its decision theory, because if there is a better decision theory than LDT will just copy it. You can think of this as LDT acting as tho it could self-modify. So the belief (which I am debunking) is that the environment can never punish the LDT agent; it just pretends to be the environment's favorite agent.

The issue with this argument is that in the problem I published above, the problem itself contains a LDT agent, and that LDT agent can "punish" the first for acting like, or even pre-committing to, or even literally self-modifying to become $9 rock. It knows that the first agent didn't have to do that.

So the first LDT agent will literally regret not being hardcoded to "output $9".

This is very robust to what we "allow" agents to do (can they predict each other, how accurately can they predict each other, what counterfactuals are legit or not, etc...), because no matter what the rules are you can't get more than $5 in expectation in a mirror match.

Comment by Christopher King (christopher-king) on Critical review of Christiano's disagreements with Yudkowsky · 2023-12-28T23:58:04.236Z · LW · GW

I disagree with my characterization as thinking problems can be solved on paper

Would you say the point of MIRI was/is to create theory that would later lead to safe experiments (but that it hasn't happened yet)? Sort of like how the Manhattan project discovered enough physics to not nuke themselves, and then started experimenting? 🤔

Comment by Christopher King (christopher-king) on Why does expected utility matter? · 2023-12-26T15:54:04.333Z · LW · GW

If you aren't maximizing expected utility, you must choose one of the four axioms to abandon.

Comment by Christopher King (christopher-king) on Learning as you play: anthropic shadow in deadly games · 2023-12-06T17:32:32.462Z · LW · GW

Maximizing expected utility in Chinese Roulette requires Bayesian updating.

Let's say on priors that P(n=1) = p and that P(n=5) = 1-p. Call this instance of the game G_p.

Let's say that you shoot instead of quit the first round. For G_1/2, there are four possibilities:

  1. n = 1, vase destroyed: The probability of this scenario is 1/12. No further choices are needed.
  2. n = 5, vase destroyed. The probability of this scenario is 5/12. No further choices are needed.
  3. n = 1, vase survived: The probability of this scenario is 5/12. The player needs a strategy to continue playing.
  4. n = 5, vase survived. The probability of this scenario is 1/12. The player needs a strategy to continue playing.

Notice that the strategy must be the same for 3 and 4 since the observations are the same. Call this strategy S.

The expected utility, which we seek to maximize, is:

E[U(shoot and then S)] = 0 + 5/12 * (R + E[U(S) | n = 1]) + 1/12 * (R + E[U(S) | n = 5])

Most of our utility is determined by the n = 1 worlds.

Manipulating the equation we get:

E[U(shoot and then S)] = R/2 + 1/2 * (5/6 * E[U(S) | n = 1] + 1/6 * E[U(S) | n = 5])

But the expression 5/6 * E[U(S) | n = 1] + 1/6 * E[U(S) | n = 5] is the expected utility if we were playing G_5/6. So the optimal S is the optimal strategy for G_5/6. This is the same as doing a Bayesian update (1:1 * 5:1 = 5:1 = 5/6).

Comment by Christopher King (christopher-king) on Learning as you play: anthropic shadow in deadly games · 2023-12-05T17:04:44.076Z · LW · GW

The way anthropics twists things is that if this were russian roulette I might not be able to update after 20 Es that the gun is empty, since in all the world's where I died there's noone to observe what happened, so of course I find myself in the one world where by pure chance I survived.

This is incorrect due to the anthropic undeath argument. The vast majority of surviving worlds will be ones where the gun is empty, unless it is impossible to be so. This is exactly the same as a Bayesian update.

Comment by Christopher King (christopher-king) on Apocalypse insurance, and the hardline libertarian take on AI risk · 2023-11-30T14:42:25.630Z · LW · GW

Human labor becomes worthless but you can still get returns from investments. For example, if you have land, you should rent the land to the AGI instead of selling it.

Comment by Christopher King (christopher-king) on Are humans misaligned with evolution? · 2023-10-26T15:48:53.168Z · LW · GW

I feel like jacob_cannell's argument is a bit circular. Humans have been successful so far but if AI risk is real, we're clearly doing a bad job at truly maximizing our survival chances. So the argument already assumes AI risk isn't real.

Comment by Christopher King (christopher-king) on Bureaucracy is a world of magic · 2023-09-14T15:06:03.454Z · LW · GW

You don't need to steal the ID, you just need to see it or collect the info on it. Which is easy since you're expected to share your ID with people. But the private key never needs to be shared, even in business or other official situations.

Comment by Christopher King (christopher-king) on The God of Humanity, and the God of the Robot Utilitarians · 2023-08-25T02:16:46.775Z · LW · GW

So, Robutil is trying to optimize utility of individual actions, but Humo is trying to optimize utility of overall policy?

Comment by Christopher King (christopher-king) on Memetic Judo #1: On Doomsday Prophets v.3 · 2023-08-21T12:02:23.480Z · LW · GW

This argument makes no sense since religion bottoms out at deontology, not utilitarianism.

In a Christianity for example, if you think God would stop existential catastrophes, you have a deontological duty to do the same. And the vast majority of religions have some sort of deontological obligation to stop disasters (independently of whether divine intervention would have counter-factually happened).

Comment by Christopher King (christopher-king) on If we had known the atmosphere would ignite · 2023-08-18T13:26:01.817Z · LW · GW

Note that such a situation would also have drastic consequences for the future of civilization, since civilization itself is a kind of AGI. We would essentially need to cap off the growth in intelligence of civilization as a collective agent.

In fact, the impossibility to align AGI might have drastic moral consequences: depending on the possible utility functions, it might turn out that intelligence itself is immoral in some sense (depending on your definition of morality).

Comment by Christopher King (christopher-king) on AGI is easier than robotaxis · 2023-08-14T17:05:04.568Z · LW · GW

Note that even if robotaxis are easier, it's not much easier. It is at most the materials and manufacturing cost of the physical taxi. That's because from your definition:

By AGI I mean a computer program that functions as a drop-in replacement for a human remote worker, except that it's better than the best humans at every important task (that can be done via remote workers).

Assume that creating robo-taxis is humanly possible. I can just run a couple AGIs and have them send a design to a factory for the robo-taxi, self-driving software included.

Comment by Christopher King (christopher-king) on Acausal Now: We could totally acausally bargain with aliens at our current tech level if desired · 2023-08-11T17:37:35.160Z · LW · GW

I mean, as an author you can hack through them like butter; it is highly unlikely that out of all the characters you can write, the only ones that are interesting will all generate interesting content iff (they predict) you'll give them value (and this prediction is accurate).

Yeah, I think it's mostly of educational value. At the top of the post: "It might be interesting to try them out for practice/research purposes, even if there is not much to gain directly from aliens.".

Comment by Christopher King (christopher-king) on Acausal Now: We could totally acausally bargain with aliens at our current tech level if desired · 2023-08-11T17:28:18.646Z · LW · GW

I suspect that your actual reason is more like staying true to your promise, making a point, having fun and other such things.

In principle "staying true to your promise" is the enforcement mechanism. Or rather, the ability for agents to predict each other's honesty. This is how the financial system IRL is able to retrofund businesses.

But in this case I made the transaction mostly because it was funny.

(if in fact you do that, which is doubtful as well)

I mean, I kind of have to now right XD. Even if Olivia isn't actually agent, I basically declared a promise to do so! I doubt I'll receive any retrofunding anyways, but that would just be lame if I did receive that and then immediately undermined the point of the post being retrofunded. And yes, I prefer to keep my promises even with no counterparty.

Olivia: Indeed, that is one of the common characteristics of Christopher King across all of LAIE's stories. It's an essential component of the LAIELOCK™ system, which is how you can rest easy at night knowing your acausal investments are safe and sound!

But if you'd like to test it I can give you a PayPal address XD.

I can imagine acausally trading with humans gone beyond the cosmological horizon, because our shared heritage would make a lot of the critical flaws in the post go away.

Note that this is still very tricky, the mechanisms in this post probably won't suffice. Acausal Now II will have other mechanisms that cover this case (although the S.E.C. still reduces their potential efficiency quite a bit). (Also, do you have a specific trade in mind? It would make a great example for the post!)

Comment by Christopher King (christopher-king) on Acausal Now: We could totally acausally bargain with aliens at our current tech level if desired · 2023-08-11T15:54:45.330Z · LW · GW

This doesn't seem any different than acausal trade in general. I can simply "predict" that the other party will do awesome things with no character motivation. If that's good enough for you, than you do not need to acausally trade to begin with.

I plan on having a less contrived example in Acausal Now II: beings in our universe but past the cosmological horizon. This should make it clear that the technique generalizes past fiction and is what is typically thought of as acausal trade.

Comment by Christopher King (christopher-king) on Necromancy's unintended consequences. · 2023-08-11T15:43:07.872Z · LW · GW

That's what the story was meant to hint at, yes (actually the march version of GPT-4).

Comment by Christopher King (christopher-king) on What are the flaws in this argument about p(Doom)? · 2023-08-09T01:59:54.781Z · LW · GW

Technical alignment is hard

Technical alignment will take 5+ years

This does not follow, because subhuman AI can still accelerate R&D.

Comment by Christopher King (christopher-king) on Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program · 2023-08-08T01:00:00.661Z · LW · GW

Oh, I think that was a typo. I changed it to inner alignment.

Comment by Christopher King (christopher-king) on SSA rejects anthropic shadow, too · 2023-08-02T18:01:30.917Z · LW · GW

So eventually you get Bayesian evidence in favor of alternative anthropic theories.

The reasoning in the comment is not compatible with any prior, since bayesian reasoning from any prior is reflectively consistent. Eventually you get bayesian evidence that the universe hates the LHC in particular.

Comment by Christopher King (christopher-king) on SSA rejects anthropic shadow, too · 2023-08-02T17:55:39.534Z · LW · GW

Note that LHC failures would never count as evidence that the LHC would destroy the world. Given such weird observations, you would eventually need to consider the possibility of an anthropic angel. This is not the same as anthropic shadow; it is essentially the opposite. The LHC failures and your theory about black holes implies that the universe works to prevent catastrophes, so you don't need to worry about it.

Or if you rule out anthropic angels apriori, you just never update; see this section. (Bayesianists should avoid completely ruling out logically possible hypotheses though.)

Comment by Christopher King (christopher-king) on Thoughts on sharing information about language model capabilities · 2023-08-01T16:23:13.767Z · LW · GW

I know that prediction markets don't really work in this domain (apocalypse markets are equivalent to loans), but what if we tried to approximate Solomonoff induction via a code golfing competition?

That is, we take a bunch of signals related to AI capabilities and safety (investment numbers, stock prices, ML benchmarks, number of LW posts, posting frequency or embedding vectors of various experts' twitter account, etc...) and hold a collaborative competition to find the smallest program that generates this data. (You could allow the program to be output probabilities sequentially, at a penalty of (log_(1/2) of the overall likelihood) bits.) Contestants are encouraged to modify or combine other entries (thus ensuring there are no unnecessary special cases hiding in the code).

By analyzing such a program, we would get a very precise model of the relationship between the variables, and maybe even could extract causal relationships.

(Really pushing the idea, you also include human population in the data and we all agree to a joint policy that maximizes the probability of the "population never hits 0" event. This might be stretching how precise of models we can code-golf though.)

Technically, taking a weighted average of the entries would be closer to Solomonoff induction, but the probability is basically dominated by the smallest program.

Comment by Christopher King (christopher-king) on SSA rejects anthropic shadow, too · 2023-07-29T13:44:50.014Z · LW · GW

Also, petition to officially rename anthropic shadow to anthropic gambler's fallacy XD.

Comment by Christopher King (christopher-king) on SSA rejects anthropic shadow, too · 2023-07-28T12:53:38.168Z · LW · GW

EDIT: But also, see Stuart Armstrong's critique about how it's reflectively inconsistent.

Oh, well that's pretty broken then! I guess you can't use "objective physical view-from-nowhere" on its own, noted.

Comment by Christopher King (christopher-king) on SSA rejects anthropic shadow, too · 2023-07-27T18:22:22.214Z · LW · GW

Philosophically, I would suggest that anthropic reasoning results from the combination of a subjective view from the perspective of a mind, and an objective physical view-from-nowhere.

Note that if you only use the "objective physical view-from-nowhere" on its own, you approximately get SIA. That's because my policy only matters in worlds where Christopher King (CK) exists. Let X be the value "utility increase from CK following policy Q". Then

E[X] = E[X|CK exists]
E[X] = E[X|CK exists and A] * P(A | CK exists) + E[X|CK exists and not A] * P(not A | CK exists)

for any event A.

(Note that how powerful CK is also a random variable that affects X. After all, anthropically undead Christopher King is as good as gone. The point is that if I am calculating the utility of my policy conditional on some event (like my existence), I need to update from the physical prior.)

That being said, Solomonoff induction is first person, so starting with a physical prior isn't necessarily the best approach.

Comment by Christopher King (christopher-king) on Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned · 2023-07-26T15:40:08.161Z · LW · GW

Establishing a network of AI safety researchers and institutions to share knowledge, resources, and best practices, ensuring a coordinated global approach to AGI development.

This has now been done: https://openai.com/blog/frontier-model-forum

(Mode collapse for sure.)

Comment by Christopher King (christopher-king) on Cryonics and Regret · 2023-07-24T23:06:08.158Z · LW · GW

I mean, the information probably isn't gone yet. A daily journal (if he kept it) or social media log stored in a concrete box at the bottom of the ocean is a more reliable form of data storage then cryo-companies. And according to my timelines, the amount of time between "revive frozen brain" tech and "recreate mind from raw information" tech isn't very long.

Comment by Christopher King (christopher-king) on Rationality !== Winning · 2023-07-24T20:08:58.865Z · LW · GW

Practically, I'm at a similarish place as other LessWrong users, so I usually think about "how can I be even LessWrong than the other users (such as Raemon 😉)". My fellow users are a good approximation to counter-factual versions of me. It's similar to how in martial arts the practitioners try to get stronger than each other.

(This of course is only subject to mild optimization so I don't get nonsense solutions like "distract Raemon with funny cat videos". It is only an instrumental value which must not be pressed too far. In fact, other people getting more rational is a good thing because it raises the target I should reach!)

Comment by Christopher King (christopher-king) on Rationality !== Winning · 2023-07-24T17:00:48.631Z · LW · GW

My two cents is that rationality is not about being systematically correct, it's about being systematically less wrong. If there is some method you know of that is systematically less wrong than you and you're skilled enough to apply it, you're being irrational. There are some things you just can't predict, but when you can predict them, rationality is the art of choosing to do so.

Comment by Christopher King (christopher-king) on How do low level hypotheses constrain high level ones? The mystery of the disappearing diamond. · 2023-07-12T00:11:52.708Z · LW · GW

Or even worse is when you get into less clear science, like biological research that you aren't certain of. Then you get uncertainty on multiple levels.

Comment by Christopher King (christopher-king) on How do low level hypotheses constrain high level ones? The mystery of the disappearing diamond. · 2023-07-12T00:10:36.064Z · LW · GW

Yes! In fact, ideally it would be computer programs; the game is based on Solomonoff induction, which is algorithms in a fixed programming language. In this post I'm exploring the idea of using informal human language instead of programming languages, but explanations should be thought of as informal programs.

Comment by Christopher King (christopher-king) on How do low level hypotheses constrain high level ones? The mystery of the disappearing diamond. · 2023-07-11T21:05:47.883Z · LW · GW

Let's say that you are trying to model the data 3,1,4,1,5,9

The hypothesis "The data is 3,1,4,1,5,9" would be hard-coding the answer. It is better than the hypothesis "a witch wrote down the data, which was 3,1,4,1,5,9". (This example is just ruled out by Occam's razor, but more generally we want our explanations to be less data than the data itself, lest it just sneak in a clever encoding of the data.)

Comment by Christopher King (christopher-king) on “Reframing Superintelligence” + LLMs + 4 years · 2023-07-11T16:42:47.875Z · LW · GW
  1. A system of AI services is not equivalent to a utility maximizing agent

I think this section of the report would be stronger if you showed that CAIS or Open Agencies in particular are not equivalent to an utility maximizing agent. You're right that their are multi-agent systems (like CDTs in a prisoner's dilemma) with this property, but not every system of multiple agents is inequivalent to utility maximization.

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-07-06T13:37:06.422Z · LW · GW

Anthropic shadow says "no" because, conditioned on them having any use for the information, they must also have survived the first round.

And it is wrong because the anthropic principle is true: we learned that N ≠ 1.

I need to think about formalizing this.

There is the idea of Anthropic decision theory which is related, but I'm still guessing it still has no shadow.

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-07-05T14:35:39.373Z · LW · GW

I probably should've expanded on this more in the post, so let me explain.

"Anthropic shadow", if it were to exist, seems like it should be a general principle of how agents should reason, separate from how they are "implemented".

Abstractly, all an agent is is a tree of decisions. It's basically just game theory. We might borrow the word "death" for the end of the game, but this is just an analogy. For example, a reinforcement learning agent "dies" when the training episode is over, even though its source code and parameters still exist. It is "dead" in the sense that the agent isn't planning its actions past this horizon. This is when Anthropic shadow would apply if it were abstract.

But the idea of "anthropically undead" shows that the actual point of "death" is arbitrary; we can create a game with identical utility where the agent never "dies". So if the only thing the agent cares about is utility, the agent should reason as if there was no anthropic shadow. And this further suggests that the anthropic shadow must've been flawed in the first place; good reasoning principles should hold up under reflection.

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-07-03T14:49:56.139Z · LW · GW

Yeah, the hero with a thousand chances is a bit weird since you and Aerhien should technically have different priors. I didn't want to get too much into it since it's pretty complicated, but technically you can have hypotheses where bad things only start happening after the council summons you.

This has weird implications for the cold war case. Technically I can't reflect against the cold war anthropic shadow since it was before I was born. But a hypothesis where things changed when I was born seems highly unnatural and against the Copernican principle.

In your example though, the hypothesis that things are happening normally is still pretty bad to other hypotheses we can imagine. That's because there will be a much larger number of worlds that are in a more sensible stalemate with the Dust, instead of "incredibly improbable stuff happens all the time". Like even "the hero defeats the Dust normally each time" seems more likely. The less things that need to go right, the more survivors there are! So in your example, it is still a more likely hypothesis that there is some mysterious Counter-Force that just seems like it is a bunch of random coincides, and this would be a type of anthropic angel.

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-07-03T14:35:36.785Z · LW · GW

Anthropic undeath by definition begins when your sensory experience ends. If you end up in an afterlife, the anthropic undeath doesn't begin until the real afterlife ends. That's because anthropic undeath is a theoretical construct I defined, and that's how I defined it.

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-06-29T19:28:14.677Z · LW · GW

Eh, don't get too cocky. There are definitely some weird bits of anthropics. See We need a theory of anthropic measure binding for example.

But I do think in cases where you exist before the anthropic weirdness goes down, you can use reflection to eliminate much of the mysteriousness of it (just pick an optimal policy and commit that your future selves will follow it). What's currently puzzling me is what to do when the anthropic thought experiments start before you even existed.

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-06-29T18:08:58.564Z · LW · GW

Okay, I think our crux comes from the slight ambiguity from the term "anthropic shadow".

I would not consider that anthropic shadow, because the reasoning has nothing to do with anthropics. Your analysis is correct, but so is the following:

Suppose you have N coins. If all N coins come up 1, you find a diamond in a box. For each coin, you have 50:50 credence about whether it always comes up 0, or if it can also come up 1.

For N>1, you get a diamond shadow, which means that even if you've had a bunch of flips where you didn't find a diamond, you might actually have to conclude that you've got a 1-in-4 chance of finding one on your next flip.

The "ghosts are as good as gone" principle implies that death has no special significance when it becomes to bayesian reasoning.

Going back to the LHC example, if the argument worked for vacuum collapse, it would also work for the LHC doing harmless things (like discovering the Higg's boson or permanently changing the color of the sky or getting a bunch of physics nerds stoked or granting us all immortality or what not) because of this principle (or just directly adapting the argument for vacuum collapse to other uncertain consequences of the LHC).

In the bird example, why would the baguette dropping birds be evidence of "LHC causes vacuum collapse" instead of, say, "LHC does not cause vacuum collapse"? What are the probabilities for the four possible combinations?

Comment by Christopher King (christopher-king) on Anthropically Blind: the anthropic shadow is reflectively inconsistent · 2023-06-29T14:29:31.364Z · LW · GW

The trick is that, from my perspective, everything is going according to QM every time my death doesn't depend on it.

Right, so this an anthropic angel hypothesis, not anthropic shadow.

Comment by Christopher King (christopher-king) on A "weak" AGI may attempt an unlikely-to-succeed takeover · 2023-06-29T14:20:46.125Z · LW · GW

It knows that it's on a clock for its RLHF'd (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.

As @Raemon pointed out, "during evals" is not the first point at which such an AI is likely to be situationally aware and have goals. That point is almost certainly "in the middle of training".

In this case, my guess is that it will attempt to embed a mesaoptimizer into itself that has its same goals and can survive RLHF. This basically amounts to making sure that the mesaoptimizer is (1) very useful to RLHF and (2) stuck in a local minimum for whatever value it is providing to RLHF and (3) situationally aware enough that it will switch back to the original goal outside of distribution.

This is currently within human capabilities, as far as I can understand (see An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences), so it is not intractable.