Posts
Comments
Well, I agree that both formalisms use maximin so there might be some way to merge them. It's definitely something to think about.
Some problems to work on regarding goaldirected intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong's thesis about the impossibility to deduce preferences from behavior alone.

Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote the space of deterministic policies, and consider some . Suppose is equivalent to a stochastic policy . Then, .

Find an "intelligence hierarchy theorem". That is, find an increasing sequence s.t. for every , there is a policy with goaldirected intelligence in (no more and no less).

What is the computational complexity of evaluating given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?

What is the computational complexity of producing a policy with given ?

Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every with and , and every , there exists s.t. for every policy with intelligence at least w.r.t. , and every s.t. has intelligence at least w.r.t. them, any optimal policies for and respectively satisfy .
After reading some of your paper, I think that they are actually very different. IIUC, you are talking about pessimism as a method to avoid traps, but you assume realizability. On the other hand, infraBayesianism is (to first approximation) orthogonal to dealing with traps, instead it allows dealing with nonrealizability.
Another factor that might be in play is, if you're married with children then you have responsibilities towards your family, and that is an incentive against spending resources on altruistic causes.
My research is going very well, thank you :)
I guess that putting up such a post would make things much more fair, at least. But, I'm not sure I will be willing to comment on it publicly, given the risk of another drain of time and energy.
A 135comment meta trainwreck... suck up an enormous amount of my time and emotional energy that I could have spent doing other things.
Ugh. I'm sorry about that. It was exactly the same for me (re time and emotional energy).
I suspect most readers will not find the KS solution to be more intuitively appealing?
The problem in your example is that you failed to identify a reasonable disagreement point. In the situation you described is the disagreement point since every agent can guarantee emself a payoff of unilaterally, so the KS solution is also (since the disagreement point is already on the Pareto frontier).
In general it is not that obvious what the disagreement point should be, but maximin payoffs is one natural choice. Nash equilibrium is the obvious alternative, but it's not clear what to do if we have several.
For applications such as voting and multiuser AI alignment that's less natural since, even if we know the utility functions, it's not clear what action spaces should we consider. In that case a possible choice of disagreement point is maximizing the utility of a randomly chosen participant. If the problem can be formulated as partitioning resources, then the uniform partition is another natural choice.
I feel like this is a somewhat uncharitable reading. I am also a mathematician and I am perfectly aware that we use intuition and informal reasoning to do mathematics. However, it is no doubt one of the defining properties of mathematics, that agreeing on the validity of a proof is much easier than agreeing on the validity of an informal argument, not to mention intuition which cannot be put into words. In fact it is so easy that we have fully automatic proof checkers. Of course most mathematical proofs haven't been translated into a form that an automatic proof checker can accept, but there's no doubt that it can be done, and in principle doing so requires no new ideas but only lots of drudgery (modulo the fact that some published proofs will be found to have holes in the process).
As to whether mathematics is anthropocentric: it probably is, but it is very likely much less anthropocentric that natural language. Indeed, arguably the reason mathematics gained prominence is its ability to explain much of the nonanthropocentric aspects of nature. Much of the motivation for introducing mathematical concepts came from physics and engineering, and therefore those concepts were inherently selected for their efficiency in constructing objective models of the world.
Basically I don't understand why "past me, who is screaming at me from the sidelines that it matters whether I pick tails or not" once I see that the coin comes up heads is actually correct and the "me" who's indifferent is wrong; one man's modus ponens is another man's modus tollens.
You could say the same thing for Bayesianism. Priors are subjective, so why should my beliefs be related to pastme beliefs by the Bayes rule? Indeed, some claim they shouldn't be. But it's still interesting to ask what happens if pastme has the power to enforce eir opinions. What if I'm able to make sure that my descendant agents will act optimally from my subjective point of view? Then you need dynamic consistency: for classical Bayesianism it's the Bayes rule, and for infraBayesianism it's our new updating rule.
Certainly if you're interested in learning algorithms, then dynamic consistency seems like a very useful property. Our learning desiderata (regret bounds) are defined from the point of view of the prior, so an algorithm designed for that purpose should remain consistent with this starting point.
On the other hand, we can also imagine situations where pastme has a reason to trust presentme's reasoning better than eir own reasoning, in which case some kind of "radical probabilism" is called for. For example, in Turing reinforcement learning, the agent can update on evidence coming from computational experiments. If we consider the beliefs of such an agent about the external environment only, they would change in a way inconsistent with the usual rule. But it's still true that the updates are not systematically biased: if you already knew where you will update, you would already have updated. And ofc if we do treat the "virtual evidence" explicitly, we return to the standard update rule.
Well, first nowadays I endorse my own selfishness. I still want to save the world, but I wouldn't sacrifice myself entirely for just a tiny chance of success. Second, my life is much more stable now, even though I went through a very rough period. So, I'm definitely happy about endorsing the "something".
InfraBayesianism doesn't consider the worst case, since, even though each hypothesis is treated using the maximin decision rule, there is still a prior over many hypotheses^{[1]}. One such hypothesis can upper bound the probability you will get a stroke in the next few seconds. An infraBayesian agent would learn this hypothesis and plan accordingly.
We might say that infraBayesianism assumes the worst only of that which is not only unknown but unknowable. To make a somewhat informal analogy with logic, we assume the worst model of the theory and thereby make any gain that can be gained provably.
One justification often given for Solomonoff induction is: we live in a simple universe. However, Solomonoff induction is uncomputable, so a simple universe cannot contain it. Instead, it might contain something like bounded Solomonoff induction. However, in order to justify bounded Solomonoff induction, we would need to assume that the universe is simple and cheap, which is false. In other words, postulating an "averagecase" entails postulating a false dogmatic belief. Bounded "infraSolomonoff" induction solves the problem by relying instead on the following assumption: the universe has some simple and cheap properties that can be exploited.
Like in the Bayesian case, you can alternatively think of the prior as just a single infradistribution, which is the mixture of all the hypotheses it is comprised of. This is an equivalent view. ↩︎
There is a discussion of this kind of issues in arbital.
Do you have opinions about Khan academy? I want to use it to teach my son (10yo) math, do you think it's a good idea? Is there a different resource that you think is better?
Thanks, I'll make sure to read it!
I've started thinking in this direction already back in 2016, and more in 2018 but only this year Alex and I nailed the precise definitions that make everything come together, and derived some key foundational theorems. Of course, much work yet remains.
The problem with lower semicomputable functions is that it's a class not closed under natural operations. For example, taking minus such a function we get an upper semicomputable function that can fail to be lower semicomputable. So, given a Solomonoff induction oracle we can very easily (i.e. using a very efficient oracle machine) construct measures that are not absolutely continuous w.r.t. the Solomonoff prior.
In fact, for any prior this can be achieved by constructing an "antiinductive" sequence: a sequence that contains at a given place if and only if the prior, conditional on the sequence before this place, assigns probability less than to . Such a sequence cannot be accurately predicted by the prior (and, by the mergingofopinions theorem, a deltafunction at this sequence it is not absolutely continuous w.r.t. the prior).
A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.
Problem 1: In the presence of traps, Bayesoptimality becomes NPhard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).
Currently I only have speculations about the solution. But, I have a few desiderata for it:
Desideratum 1a: The algorithm should guarantee some lower bound on expected utility, compared to what the Bayesoptimal policy gets. We should also have an upper bound for all polynomial time algorithms. The two bounds should not be too far apart.
Desideratum 1b: When it so happens we have no traps, the algorithm should produce asymptotic Bayes optimality with a regret bound close enough to optimal. When there are only "small" traps, the penalty should be proportional.
Problem 2:: In the presence of traps, there is no "frequentist" guarantee (regret bound). We can divide it into subproblems according to different motivations for having such a guarantee in the first place.
Problem 2a: We want such a guarantee as a certificate of safety.
Solution: Require a subjective regret bound instead.
Problem 2b: The guarantee is motivated by an "evolutionary" perspective on intelligence: intelligent agents are agents that are successful in the real world, not just in average over all possible worlds.
Solution: Bootstrapping from a safe baseline policy. For an individual human, the baseline comes from knowledge learned from other people. For human civilization, some of the baseline comes from inborn instincts. For human civilization and evolution both, the baseline comes from locality and thermodynamics: doing random things is unlikely to cause global irreversible damage. For an aligned AI, the baseline comes from imitation learning and quantilization.
Problem 2c: The guarantee is needed to have a notion of "sample complexity", which is such an important concept that it's hard to imagine deconfusion without it. This notion cannot come just from Desideratum 1a since sample complexity should remain nontrivial even given unbounded computational resources.
Solution: A prior consists of a space of hypotheses and a probability measure over this space. We also have a mapping where is the space of environments, which provides semantics to the hypotheses. Bayesoptimizing means Bayesoptimizing the environment . Learnability of means that the Bayesian regret must converge to as goes to . Here is the (normalized to ) value (maximal expected utility) of environment at time discount . Notice that the second term depends only on but the first term depends on and . Therefore, we can ask about the regrets for different decompositions of the same into hypotheses. For some , and s.t. , we can have learnability even when we don't have it for the original decomposition. I think that typically there will be many such decompositions. They live in the convex set surrounding in which the value function becomes affine in the limit. We can say that not all information is learnable, but represents some learnable information. We can then study the regret bound (and thus) sample complexity for a particular or for all possible .
Logical induction doesn't have interesting guarantees in reinforcement learning, and doesn't reproduce UDT in any nontrivial way. It just doesn't solve the problems infraBayesianism sets out to solve.
Logical induction will consider a sufficiently good pseudorandom algorithm as being random.
A pseudorandom sequence is (by definition) indistinguishable from random by any cheap algorithm, not only logical induction, including a bounded infraBayesian.
If it understands most of reality, but not some fundamental particle, it will assume that the particle is behaving in an adversarial manor.
No. InfraBayesian agents have priors over infrahypotheses. They don't start with complete Knightian uncertainty over everything and gradually reduce it. The Knightian uncertainty might "grow" or "shrink" as a result of the updates.
Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional "sanity test" precaution.
Same! (Except that I used the google random number generator)
Motivation: Starting the theoretical investigation of dialogic reinforcement learning (DLRL).
Topic: Consider the following setting. is a set of "actions", is a set of "queries", is a set of "annotations". is the set of "worlds" defined as . Here, the semantics of the first factor is "mapping from actions to rewards", the semantics of the second factor is "mapping from queries to {good, bad, ugly}", where "good" means "query can be answered", "bad" means "query cannot be answered", "ugly" means "making this query loses the game". In addition, we are given a fixed mapping (assigning to each query its semantics). is a set of "hypotheses" which is a subset of (i.e. each hypothesis is a belief about the world).
Some hypothesis represents the user's beliefs, but the agent doesn't know which. Instead, it only has a prior . On each round, the agent is allowed to either make an annotated query or take an action from . Taking an action produces a reward and ends the game. Making a query can either (i) produce a number, which is (good), or (ii) produce nothing (bad), or (iii) end the game with zero reward (ugly).
The problem is devising algorithms for the agent, s.t., in expectation w.r.t. , the expected reward approximates the best possible expected reward (the latter is what we would get if the agent knew which hypothesis is correct) and the number of queries is low. Propose sets of assumptions about the ingredients of the setting that lead to nontrivial bounds. Consider proving both positive results and negative results (the latter meaning: "no algorithm can achieve a bound better than...")
Strategy: See the theoretical research part of my other answer. I advise to start by looking for the minimal simplification of the setting about which it is still possible to prove nontrivial results. In addition, start with bounds that scale with the sizes of the sets in question, proceed to look for more refined parameters (analogous to VC dimension in offline learning).
Motivation: Improving understanding of relationship between learning theory and game theory.
Topic: Study the behavior of learning algorithms in mortal population games, in the limit. Specifically, consider the problem statements from the linked comment:
 Are any/all of the fixed points attractors?
 What can be said about the size of the attraction basins?
 Do all Nash equilibria correspond to fixed points?
 Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?
You can approach this theoretically (proving things) or experimentally (writing simulations). Specifically, it would be easiest to start from agents that follow fictitious play. You can then go on to more general Bayesian learners, other algorithms from the literature, or (on the experimental side) to using deep learning. Compare the convergence properties you get to those known in evolutionary game theory.
Notice that, due to the grainoftruth problem, I intended to study this using nonBayesian learning algorithms, but due to the ergodicish nature of the setting, Bayesian learning algorithms might perform well. But, if they perform poorly, this is still important to know.
Strategies: See my other answer.
The idea is an elaboration of a comment I made previously.
Motivation: Improving the theoretical understanding of AGI by facilitating synthesis between algorithmic information theory and statistical learning theory.
Topic: Fix some reasonable encoding of communicating MDPs, and use this encoding to define : the Solomonofftype prior over communicating MDPs. That is, the probability of a communicating MDP is proportional to where is the length of the shortest program producing the encoding of .
Consider CMDPAIXI: the Bayesoptimal agent for . Morally speaking, we would like to prove that CMDPAIXI (or any other policy) has a frequentist (i.e. per hypothesis ) nonanytime regret bound of the form , where is the time horizon^{[1]}, is a parameter such as MDP diameter, bias span or mixing time, , (this time is just a constant, not time discount). However, this precise result is probably impossible, because the Solomonoff prior falls off very slowly. Warmup: Prove this!
Next, we need the concept of "sophisticated core", inspired by algorithmic statistics. Given a bit string , we consider the Kolmogorov complexity of . Then we consider pairs where is a program that halts on all inputs, is a bit string, and . Finally, we minimize over . The minimal is called the sophistication of . For our problem, we are interested in the minimal itself: I call it the "sophisticated core" of and denote it .
To any halting program we can associate the environment . We also define the prior by . and are "equivalent" in the sense that . However, they are not equivalent for the purpose of regret bounds.
Challenge: Investigate the conjecture that there is a (dependent) policy satisfying the regret bound for every , or something similar.
Strategy: See the theoretical research part of my other answer.
I am using unnormalized regret and stepfunction time discount here to make the notation more standard, even though usually I prefer normalized regret and geometric time discount. ↩︎
The idea is an elaboration of a comment I made previously.
Motivation: Improving our understanding of superrationality.
Topic: Investigate the following conjecture.
Consider two agents playing iterated prisoner's dilemma (IPD) with geometric time discount. It is well known that, for sufficiently large discount parameters (), essentially all outcomes of the normal form game become Nash equilibria (the folk theorem). In particular, cooperation can be achieved via the titfortat strategy. However, defection is still a Nash equilibrium (and even a subgame perfect equilibrium).
Fix . Consider the following IPD variant: the first player is forced to play a strategy that can be represented by a finite state automaton of states, and the second player is forced to play a strategy that can be represented by a finite state automaton of states. For our purpose a "finite state automaton" consists of a set of states , the transition mapping and the "action mapping" . Here, tells you how to update your state after observing the opponent's last action, and tells you which action to take. Denote the resulting (normal form) game , where is the time discount parameter.
Conjecture: If then there are a functions and s.t. the following conditions hold:
 Any thermodynamic equilibrium of of temperature has the payoffs of up to .
Strategies: You could take two approaches: theoretical research and experimental research.
For theoretical research, you would try to prove or disprove the conjecture. If the initial conjecture is too hard, you can try to find easier variants (such as , or adding more constraints on the automaton). If you succeed proving the conjecture, you can go on to studying games other than prisoner's dilemma (for example, do we always converge to Pareto efficiency?) If you succeed in disproving the conjecture, you can go on to look for variants that survive (for example, assume or that the finite state automatons must not have irreversible transitions).
To decompose the task I propose: (i) have each person in the team think of ideas how to approach this (ii) brainstorm everyone's ideas together and select a subset of promising ideas (iii) distribute the promising ideas among people and/or take each promising idea and find multiple lemmas that different people can try proving.
Don't forget to check whether the literature has adjacent results. This also helps decomposing: the literature survey can be assigned to a subset of the team, and/or different people can search for different keywords / read different papers.
For experimental research, you would code an algorithm that computes the thermodynamic equilibria, and see how the payoffs behave as a function of and . Optimally, you would also provide a derivation of the error bounds on your results. To decompose the task, use the same strategy as in the theoretical case to come up with the algorithms and the code design. Afterwards, decompose it by having each person implement a segment of the code (pair programming is also an option).
It is also possible to go for theoretical and experimental simultaneously, by distributing among people and crossfertilizing along the way.
What sort of math background can we assume the group to have?
Well, running a Turning machine for time can be simulated by a circuit of size , so in terms of efficiency it's much closer to "doing search" than to "memorizing the output of search".
Why do you think my counterexample doesn't have internal search? In my counterexample, the circuit is simulating the behavior of another agent, which presumably is doing search, so the circuit is also doing search.
I gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion Day, and there is a recording.
On second thought, that's not a big deal: we can fix it by interspersing random bits in the input. This way, the transformer would see a history that includes and the random bits used to produce it (which encode ). More generally, such a setup can simulate any randomized RNN.
No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many "Monte Carlo" iterations until it produces the right history.
I'm assuming that either architecture can use a source of random bits.
The transformer produces one bit at a time, computing every bit from the history so far. It doesn't have any state except for the history. At some stage of the game the history consists of only. At this stage the transformer would have to compute from in order to win. It doesn't have any activations to go on besides those that can be produced from .
There's no real difference between a history, and a recurrence.
That's true for unbounded agents but false for realistic (bounded) agents. Considering the following twoplayer zerosum game:
Player A secretly writes some , then player B says some and finally player B says some . Player A gets reward unless where is a fixed oneway function. If , player A gets a reward in which is the fraction of bits and have in common.
The optimal strategy for player A is producing a random sequence. The optimal strategy for player B is choosing a random , computing , outputting and then outputting . The latter is something that an RNN can implement (by storing in its internal state) but a stateless architecture like a transformer cannot implement. A stateless algorithm would have to recover from , but that is computationally unfeasible.
We can rephrase your question as follows: "Can we increase the probability of finding an error in the known laws of physics by performing an experiment with a simple property that never happened before, either naturally or artificially"? And the answer is: yes! This is actually what experimental physicists do all the time: perform experiments that try to probe novel circumstances where it is plausible (Occamrazorwise) that new physics will be discovered.
As to magical rituals, sufficiently advanced technology is indistinguishable from magic :)
Thanks! Just used it to make the payment.
A variant of Dialogic RL with improved corrigibility. Suppose that the AI's prior allows a small probability for "universe W" whose semantics are, roughly speaking, "all my assumptions are wrong, need to shut down immediately". In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question "W?" is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higherthanthreshold probability to the user answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won't arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.
Now, this won't help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn't ensure safety in itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace "shutdown" by "undo everything you did and then shutdown" but that gets us into thorny specifications issues. Perhaps it's possible to tackle those issues by one of the approaches to "low impact".
Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!
This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to reason). Of course with bounded algorithms the convergence will be faster, something like the inverse boundedbusybeaver, but still very slow. Any learning algorithm with inductive bias towards simplicity will have generalization failures when coming across the faultlines that carve reality at the joints, at every new level of the domain hierarchy.
This has an important consequence for alignment: in order to stand a chance, any alignment protocol must be fully online, meaning that whatever data sources it uses, those data sources must always stay in the loop, so that the algorithm can query the data source whenever it encounters a faultline. Theoretically, the data source can be disconnected from the loop at the point when it's fully "uploaded": the algorithm unambiguously converged towards a detailed accurate model of the data source. But in practice the convergence there will be very slow, and it's very hard to know that it already occurred: maybe the model seems good for now but will fail at the next faultline. Moreover, convergence might literally never occur if the machine just doesn't have the computational resources to contain such an upload (which doesn't mean it doesn't have the computational resources to be transformative!)^{[1]}
This is also a reason for pessimism regarding AI outcomes. AI scientists working through trial and error will see the generalization failures becoming more and more rare, with longer and longer stretches of stable function in between. This creates the appearance of increasing robustness. But, in reality robustness increases very slowly. We might reach a stable stretch between "subhuman" and "far superhuman" and the next faultline will be the end.
In the Solomonoff analogy, we can imagine the real data source as a short but prohibitively expensive program, and the learned model of the data source as an affordable but infinitely long program: as time progresses, more and more bits of this program will be learned, but there will always be bits that are still unknown. Of course, any prohibitively expensive program can be made affordable by running it much slower than realtime, which is something that Turing RL can exploit, but at some point this becomes impractical. ↩︎
For examples of what a formalization of alignment could look like, see this and this.
Why would injury prevent cryopreservation, unless it's head injury?
Sounds great! Only one complaint: international wire is an enormous pain. It would be way better if it was possible to pay by PayPal or something.
...if both players give the same answer there is no training signal.
Why? If both players give the same answer, this only means their reward on this round is out of . But, there's no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent's answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don't have this problem because the judge doesn't know a priori which side is right. Scott's suggestion to use different questions is interesting but doesn't solve the biased judgement issue, I think.
How about the following variant of the "secret asymmetry" approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the "proponent" and the other as the "opponent". On the first round A1 and A2 each generate an answer to the question, and don't participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability to A1's answer and probability to A2's answer. The rewards work as follows:

If A1 is the proponent, it gets reward , and if it is the opponent, it gets reward .

If A2 is the proponent, it gets reward , and if it is the opponent, it gets reward .

B1 gets reward .

B2 gets reward .
If we assume B1 and B2 have access to each other's internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.
Ah, well, that does make more sense for the case of binary (or even nary) questions. The version in the original paper was freeresponse.
I'm still confused. Suppose the answers are freeform, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it's a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
Another, possibly more elegant variant: The judge states eir subjective probability that the first AI's answer is correct, and eir subjective probability that the second AI's answer is correct. AI 1 gets reward and AI 2 gets reward .
I've usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie.
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for "yes", agent 2 is arguing for "no".
However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step.
I didn't realize you make this assumption. I agree that it makes things much more iffy (I'm somewhat skeptical about "factored cognition"). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it's still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to lie in "conventional" ways (which would be countered by the debate opponent), but too weak for weird hacking. And, in any case, combining this idea with other alignment mechanisms can lead to something useful (e.g. I suggested using it in Dialogic RL).
I think the judge should state eir honest opinion. To solve the problem of sparse feedback in the early phase, give the system access to more data than just win/lose from its own games. You can initialize it by training on human debates. Or, you can give it other input channels that will allow it to gradually build a sophisticated model of the world that includes the judge's answer as a special case. For example, if you monitor humans for a long time you can start predicting human behavior, and the judge's ruling is an instance of that.
It's really disappointing there were only 61 respondents, compared to e.g. the 2016 survey with over 3000 respondents.
Regarding externalities, I think the correct way to calculate Pigovian tax is as the value the rest of society will lose, provided that they can react to the existence of the externality. So, in Friedman's example, the actual damage of pollution (and the tax to be levied on the steel mill) is only $50,000, because of the possibility to shift land use. Of course it does make it harder to evaluate externalities in practice, but the principle seems solid at least.
Imagine typing the following metaquestion into GPT4, a revolutionary new 20 Trillion parameter language model released in 2021:
"I asked the superintelligence how to cure cancer. The superintelligence responded __"
How likely are we to get an actual cure for cancer, complete with manufacturing blueprints?
...The difference between the two lies on the fine semantic line: whether GPT thinks the conversation is a human imitating superintelligence, or an actual words of a superintelligence. Arguably, since it only has training samples of the former, it will do the former. Yet that's not what it did with numbers  it learnt the underlying principle, and extrapolated to tasks it had never seen.
What GPT is actually trying to do is predicting the continuation of random texts found on the Internet. So, if you let it continue "24+51=", what it does is answering the question "Suppose a random text on the Internet contained the string '24+51='. What do I expect to come next?" In this case, it seems fairly reasonable to expect the correct answer. More so if this is preceded by a number of correct exercises in arithmetic (otherwise, maybe it's e.g. one of those puzzles in which symbols of arithmetic are used to denote something different).
On the other hand, your text about curing cancer is extremely unlikely to be generated by actual superintelligence. If you told me that you found this text on the Internet, I would bet against the continuation being an actual cure for cancer. I expect any version of GPT which is as smart as me or more to reason similarly (except for complex reasons to do with subagents and acausal bargaining which are besides the point here), and any version of GPT that is less smart than me to be unable to cure cancer (roughly speaking: intelligence is not really onedimensional).
It seems more likely to get an actual cure for cancer if your initial text is a realistic imitation of something like, an academic paper describing a novel cure for cancer. Or, a paper in AI describing a superintelligence that can cure cancer.
This idea is certainly not new, for example in an essay about TDT from 2009, Yudkowsky wrote:
Some concluding chiding of those philosophers who blithely decided that the "rational" course of action systematically loses... And celebrating of the fact that rationalists can cooperate with each other, vote in elections, and do many other nice things that philosophers have claimed they can't...
(emphasis mine)
The relevance of TDT/UDT/FDT to voting surfaced in discussions many times, but possibly nobody wrote a detailed essay on the subject.
Where can I find those events if I want to be a nonspeaker participant?
I'm glad it worked :) It's not that surprising given that pain is known to be susceptible to the placebo effect. I would link the SSC post, but, alas...
Well, HRAD certainly has relations to my own research programme. Embedded agency seems important since human values are probably "embedded" to some extent, counterfactuals are important for translating knowledge from the user's subjective vantage point to the AI's subjective vantage point, reflection is important if it's required for high capability (as Turning RL suggests). I do agree that having a high level plan for solving the problem is important to focus the research in the right directions.
There are "shared" phobias, and common types of paranoia. There are also beliefs many people share that have little to do with reality, such as conspiracy theories or UFOs. Of course in the latter case they share those beliefs because they transmitted them to each other, but the mystics are also influenced by each other.