Posts

Counterfactual Mechanism Networks 2024-01-30T20:30:00.954Z
To Boldly Code 2024-01-26T18:25:59.525Z
Incorporating Mechanism Design Into Decision Theory 2024-01-26T18:25:40.373Z
Reframing Acausal Trolling as Acausal Patronage 2024-01-23T03:04:53.706Z
Incorporating Justice Theory into Decision Theory 2024-01-21T19:17:11.653Z
Legibility Makes Logical Line-Of-Sight Transitive 2024-01-19T23:39:47.213Z
Logical Line-Of-Sight Makes Games Sequential or Loopy 2024-01-19T04:05:44.782Z
In Strategic Time, Open-Source Games Are Loopy 2024-01-18T00:08:40.909Z
A Benchmark for Decision Theories 2024-01-11T18:54:40.300Z
Using Threats to Achieve Socially Optimal Outcomes 2024-01-04T23:30:54.615Z
Best-Responding Is Not Always the Best Response 2024-01-04T23:30:48.400Z
Safety Data Sheets for Optimization Processes 2024-01-04T23:30:36.510Z
The Gears of Argmax 2024-01-04T23:30:30.339Z
When Can Optimization Be Done Safely? 2023-12-30T01:24:30.234Z
Optimization Markets 2023-12-30T01:24:01.777Z
Social Choice Theory and Logical Handshakes 2023-12-29T03:49:53.576Z
Distributed Strategic Epistemology 2023-12-28T22:12:46.299Z
Building Trust in Strategic Settings 2023-12-28T22:12:24.024Z
An Ontology for Strategic Epistemology 2023-12-28T22:11:56.510Z
How Emergency Medicine Solves the Alignment Problem 2023-12-26T05:24:35.579Z
A Decision Theory Can Be Rational or Computable, but Not Both 2023-12-21T21:02:45.366Z

Comments

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-03-05T22:46:03.342Z · LW · GW

The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.

My understanding is that commitment races only occur in cases where "information about the commitments made by other agents" has negative value for all relevant agents. (All agents are racing to commit before learning more, which might scare them away from making such a commitment.)

It seems like updateless agents should not find themselves in commitment races.

My impression is that we don't have a satisfactory extension of UDT to multi-agent interactions. But I suspect that the updateless response to observing "your counterpart has committed to going Straight" will look less like "Swerve, since that's the best response" and more like "go Straight with enough probability that your counterpart wishes they'd coordinated with you rather than trying to bully you."

Offering to coordinate on socially optimal outcomes, and being willing to pay costs to discourage bullying, seems like a generalizable way for smart agents to achieve good outcomes.

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-03-05T22:08:17.455Z · LW · GW

Got it, thank you!

It seems like trapped priors and commitment races are exactly the sort of cognitive dysfunction that updatelessness would solve in generality. 

My understanding is that trapped priors are a symptom of a dysfunctional epistemology, which over-weights prior beliefs when updating on new observations. This results in an agent getting stuck, or even getting more and more confident in their initial position, regardless of what observations they actually make. 

Similarly, commitment races are the result of dysfunctional reasoning that regards accurate information about other agents as hazardous. It seems like the consensus is that updatelessness is the general solution to infohazards.

My current model of an "updateless decision procedure", approximated on a real computer, is something like "a policy which is continuously optimized, as an agent has more time to think, and the agent always acts according to the best policy it's found so far." And I like the model you use in your report, where an ecosystem of participants collectively optimize a data structure used to make decisions.

Since updateless agents use a fixed optimization criterion for evaluating policies, we can use something like an optimization market to optimize an agent's policy. It seems easy to code up traders that identify "policies produced by (approximations of) Bayesian reasoning", which I suspect won't be subject to trapped priors.

So updateless agents seem like they should be able to do at least as well as updateful agents. Because they can identify updateful policies, and use those if they seem optimal. But they can also use different reasoning to identify policies like "pay Paul Ekman to drive you out of the desert", and automatically adopt those when they lead to higher EV than updateful policies.

I suspect that the generalization of updatelessness to multi-agent scenarios will involve optimizing over the joint policy space, using a social choice theory to score joint policies. If agents agree at the meta level about "how conflicts of interest should be resolved", then that seems like a plausible route for them to coordinate on socially optimal joint policies.

I think this approach also avoids the sky-rocketing complexity problem, if I understand the problem you're pointing to. (I think the problem you're pointing to involves trying to best-respond to another agent's cognition, which gets more difficult as that agent becomes more complicated.)

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-02-12T19:28:34.875Z · LW · GW

The distinction between "solving the problem for our prior" and "solving the problem for all priors" definitely helps! Thank you!

I want to make sure I understand the way you're using the term updateless, in cases where the optimal policy involves correlating actions with observations. Like pushing a red button upon seeing a red light, but pushing a blue button upon seeing a blue light. It seems like (See Red -> Push Red, See Blue -> Push Blue) is the policy that CDT, EDT, and UDT would all implement.

In the way that I understand the terms, CDT and EDT are updateful procedures, and UDT is updateless. And all three are able to use information available to them. It's just that an updateless decision procedure always handles information in ways that are endorsed a priori. (True information can degrade the performance of updateful decision theories, but updatelessness implies infohazard immunity.)

Is this consistent with the way you're describing decision-making procedures as updateful and updateless?

 

It also seems like if an agent is regarding some information as hazardous, that agent isn't being properly updateless with respect to that information. In particular, if it finds that it's afraid to learn true information about other agents (such as their inclinations and pre-commitments), it already knows that it will mishandle that information upon learning it. And if it were properly updateless, it would handle that information properly.

It seems like we can use that "flinching away from true information" as a signal that we'd like to change the way our future self will handle learning that information. If our software systems ever notice themselves calculating a negative value of information for an observation (empirical or logical), the details of that calculation will reveal at least one counterfactual branch where they're mishandling that information. It seems like we should always be able to automatically patch that part of our policy, possibly using a commitment that binds our future self.

In the worst case, we should always be able to do what our ignorant self would have done, so information should never hurt us.

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-02-12T18:18:29.877Z · LW · GW

Got it, I think I understand better the problem you're trying to solve! It's not just being able to design a particular software system and give it good priors, it's also finding a framework that's robust to our initial choice of priors.

Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations? I'm thinking of Yudkowsky's example of the anti-Occamian and anti-Laplacian priors: the more observations an anti-Laplacian agent makes, the further its beliefs go from the truth.

I'm also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-02-09T08:07:51.264Z · LW · GW

It sounds like we already mostly agree!

I agree with Caspar's point in the article you linked: the choice of metric determines which decision theories score highly on it. The metric that I think points towards "going Straight sometimes, even after observing that your counterpart has pre-committed to always going Straight" is a strategic one. If Alice and Bob are writing programs to play open-source Chicken on their behalf, then there's a program equilibrium where:

  • Both programs first try to perform a logical handshake, coordinating on a socially optimal joint policy.
    • This only succeeds if they have compatible notions of social optimality.
  • As a fallback, Alice's program adopts a policy which
    • Caps Bob's expected payoff at what Bob would have received under Alice's notion of social optimality
      • Minus an extra penalty, to give Bob an incentive gradient to climb towards what Alice sees as the socially optimal joint policy
    • Otherwise maximizes Alice's payoff, given that incentive-shaping constraint
  • Bob's fallback operates symmetrically, with respect to his notion of social optimality.

The motivating principle is to treat one's choice of decision theory as itself strategic. If Alice chooses a decision theory which never goes Straight, after making the logical observation that Bob's decision theory always goes Straight, then Bob's best response is to pick a decision theory that always goes straight and make that as obvious as possible to Alice's decision theory.

Whereas if Alice designs her decision theory to grant Bob the highest payoff when his decision theory legibly outputs Bob's part of  (what Alice sees as a socially optimal joint policy), then Bob's best response is to pick a decision theory that outputs Bob's part of  and make that as obvious as possible to Alice's decision theory.

It seems like one general recipe for avoiding commitment races would be something like:

  • Design your decision theory so that no information is hazardous to it
    • We should never be willing to pay in order to not know certain implications of our beliefs, or true information about the world
  • Design your decision theory so that it is not infohazardous to sensible decision theories
    • Our counterparts should generally expect to benefit from reasoning more about us, because we legibly are trying to coordinate on good outcomes and we grant the highest payoffs to those that coordinate with us
    • If infohazard resistance is straightforward, then our counterpart should hopefully have that reflected in their prior.
  • Do all the reasoning you want about your counterpart's decision theory
    • It's fine to learn that your counterpart has pre-committed to going Straight. What's true is already so. Learning this doesn't force you to Swerve.
    • Plus, things might not be so bad! You might be a hypothetical inside your counterpart's mind, considering how you would react to learning that they've pre-committed to going Straight.
      • Your actions in this scenario can determine whether it becomes factual or counterfactual. Being willing to crash into bullies can discourage them from trying to bully you into Swerving in the first place.
    • You might also discover good news about your counterpart, like that they're also implementing your decision theory.
      • If this were bad news, like for commitment-racers, we'd want to rethink our decision theory.
Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-02-09T07:42:26.214Z · LW · GW

So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.

 

It seems like we should be able to design software systems that are immune to any infohazard, including logical infohazards.

  • If it's helpful to act on a piece of information you know, act on it.
  • If it's not helpful to act on a piece of information you know, act as if you didn't know it.

Ideally, we could just prove that "Decision Theory X never calculates a negative value of information". But if needed, we could explicitly design a cognitive architecture with infohazard mitigation in mind. Some options include:

  • An "ignore this information in this situation" flag
    • Upon noticing "this information would be detrimental to act on in this situation", we could decide to act as if we didn't know it, in that situation.
    • (I think this is one of the designs you mentioned in footnote 4.)
  • Cognitive sandboxes
    • Spin up some software in a sandbox to do your thinking for you.
    • The software should only return logical information that is true, and useful in your current situation
    • If it notices any hazardous information, it simply doesn't return it to you.
    • Upon noticing that a train of thought doesn't lead to any true and useful information, don't think about why that is and move on.

I agree with your point in footnote 4, that the hard part is knowing when to ignore information. Upon noticing that it would be helpful to ignore something, the actual ignoring seems easy.

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-02-09T01:26:38.681Z · LW · GW

To feed back, it sounds like "thinking more about what other agents will do" can be infohazardous to some decision theories. In the sense that they sometimes handle that sort of logical information in a way that produces worse results than if they didn't have that logical information in the first place. They can sometimes regret thinking more.

It seems like it should always be possible to structure our software systems so that this doesn't happen. I think this comes at the cost of not always best-responding to other agents' policies.

In the example of Chicken, I think that looks like first trying to coordinate on a correlated strategy, like a 50/50 mix of (Straight, Swerve) and (Swerve, Straight). (First try to coordinate on a socially optimal joint policy.)

Supposing that failed, our software system could attempt to troubleshoot why, and discover that their counterpart has simply pre-committed to always going Straight. Upon learning that logical fact, I don't think the best response is to best-respond, i.e. Swerve. If we're playing True Chicken, it seems like in this case we should go Straight with enough probability that our counterpart regrets not thinking more and coordinating with us.

Comment by StrivingForLegibility on Updatelessness doesn't solve most problems · 2024-02-08T23:29:16.113Z · LW · GW

It’s certainly not looking very likely (> 80%) that ... in causal interactions [most superintelligences] can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races.

 

What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?

(My understanding of the term "logical handshake" is as a generalization of the technique from the Robust Cooperation paper. Something like "I have a model of the other relevant decision-makers, and I will enact my part of the joint policy  if I'm sufficiently confident that they'll all enact their part of ." Is that the sort of decision-procedure that seems likely to fall into commitment races?)

Comment by StrivingForLegibility on Incorporating Mechanism Design Into Decision Theory · 2024-02-02T04:45:57.837Z · LW · GW

Thank you! I'm interested in checking out earlier chapters to make sure I understand the notation, but here's my current understanding:

There are 7 axioms that go into Joyce's representation theorem, and none of them seem to put any constraints on the set of actions available to the agent. So we should be able to ask a Joyce-rational agent to choose a policy for a game.

My impression of the representation theorem is that a formula like  can represent a variety of decision theories. Including ones like CDT which are dynamically inconsistent: they have a well-defined answer to "what do you think is the best policy", and it's not necessarily consistent with their answer to "what are you actually going to do?"

So it seems like the axioms are consistent with policy optimization, and they're also consistent with action optimization. We can ask a decision theory to optimize a policy using an analogous expression: .

It seems like we should be able to get a lot of leverage by imposing a consistency requirement that these two expressions line up. It shouldn't matter whether we optimize over actions or policies, the actions taken should be the same.

I don't expect that fully specifies how to calculate the counterfactual data structures  and , even with Joyce's other 7 axioms. But the first 7 didn't rule out dynamic or counterfactual inconsistency, and this should at least narrow our search down to decision theories that are able to coordinate with themselves at other points in the game tree.

Comment by StrivingForLegibility on Incorporating Mechanism Design Into Decision Theory · 2024-02-02T00:36:04.555Z · LW · GW

Totally! The ecosystem I think you're referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).

Games like Chicken are actually what motivated me to think in terms of "collaborating to build mechanisms to reshape incentives." If both players choose their mixed strategy separately, there's an equilibrium where they independently mix () between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Swerve, Swerve), leaving both players with an expected utility of  and wishing they could coordinate on Something Else Which Is Not That.

If they could coordinate to build a traffic light, they could correlate their actions and only mix between (Straight, Swerve) and (Swerve, Straight). A 50/50 mix of these two gives each player an expected utility of 1.5, which seems pretty fair in terms of the payoffs achievable in this game.

Anything that's mutually unpredictable and mutually observable can be use to correlate actions by different agents. Agents that can easily communicate can use cryptographic commitments to produce legibly fair correlated random signals.

My impression is that being able to perform logical handshakes creates program equilibria that can be better than any correlated equilibrium. When the traffic light says the joint strategy should be (Straight, Swerve), the player told to Swerve has an incentive to actually Swerve rather than go Straight, assuming the other player is going to be playing their part of the correlated equilibrium. But the same trick doesn't work in the Prisoners' Dilemma: a traffic light announcing (Cooperate, Cooperate) doesn't give either player an incentive to actually play their part of that joint strategy. Whereas a logical handshake actually does reshape the players' incentives: they each know that if they deviate from Cooperation, their counterpart will too, and they both prefer (Cooperate, Cooperate) to (Defect, Defect).

I haven't found any results for the phrase "correlated program equilibrium", but cousin_it talks about the setup here

AIs that have access to each other's code and common random bits can enforce any correlated play by using the quining trick from Re-formalizing PD. If they all agree beforehand that a certain outcome is "good and fair", the trick allows them to "mutually precommit" to this outcome without at all constraining their ability to aggressively play against those who didn't precommit. This leaves us with the problem of fairness.

This gives us the best of both worlds: the random bits can get us any distribution over joint strategies we want, and the logical handshake allows enforcement of that distribution so long as it's better than each player's BATNA. My impression is that it's not always obvious what each player's BATNA is, and in this sequence I recommend techniques like counterfactual mechanism networks to move the BATNA in directions that all players individually prefer and agree are fair.

But in the context of "delegating your decision to a computer program", one reasonable starting BATNA might be "what would all delegates do if they couldn't read each other's source code?" A reasonable decision theory wouldn't give in to inappropriate threats, and this removes the incentive for other decision theories to make them towards us in the first place. In the case of Chicken, the closed-source answer might be something like the mixed strategy we mentioned earlier: () mixture between Straight and Swerve.

Any logical negotiation needs to improve on this baseline. This can make it a lot easier for our decision theory to resist threats. Like in the next post, AliceBot can spin up an instance to negotiate with BobBot, and basically ignore the content of this negotiation. Negotiator AliceBot can credibly say to BobBot "look, regardless of what you threaten in this negotiation, take a look at my code. Implementer AliceBot won't implement any policy that's worse than the BATNA defined at that level." And this extends recursively throughout the network, like if they perform multiple rounds of negotiation.

Comment by StrivingForLegibility on To Boldly Code · 2024-01-28T02:00:32.864Z · LW · GW

I'd been thinking about "cleanness", but I think you're right that "being oriented to what we're even talking about" is more important. Thank you again for the advice!

Comment by StrivingForLegibility on To Boldly Code · 2024-01-27T00:43:12.758Z · LW · GW

Thank you! I started writing the previous post in this sequence and decided to break the example off into its own post. 

For anyone else looking for a TLDR: this is an example of how a network of counterfactual mechanisms can be used to make logical commitments for an arbitrary game.

Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-23T03:41:18.716Z · LW · GW

Totally! One of the most impressive results I've seen for one-shot games is the Robust Cooperation paper studying the open-source Prisoners' Dilemma, where each player delegates their decision to a program that will learn the exact source code of the other delegate at runtime. Even utterly selfish agents have an incentive to delegate their decision to a program like FairBot or PrudentBot.

I think the probabilistic element helps to preserve expected utility in cases where the demands from each negotiator exceed the total amount of resources being bargained over. If each precommits to demand $60 when splitting $100, deterministic rejection leads to ($0, $0) with 100% probability. Whereas probabilistic rejection calls for the evaluator to accept with probability slightly less than $40/$60  66.67%. Accepting leads to a payoff of ($60, $40), for an expected joint utility of slightly less than  ($40, $26.67).

I think there are also totally situations where the asymmetrical power dynamics you're talking about mean that one agent gets to dictate terms and the other gets what they get. Such as "Alice gets to unilaterally decide how $100 will be split, and Bob gets whatever Alice gives him." In the one-shot version of this with selfish players, Alice just takes the $100 and Bob gets $0. Any hope for getting a selfish Alice to do anything else is going to come from incentives beyond this one interaction.

Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-23T00:28:16.513Z · LW · GW

My point is there's a very tenuous jump from us making decisions to how/whether to enforce our preferences on others.

I think the big link I would point to is "politics/economics." The spherical cows in a vacuum model of a modern democracy might be something like "a bunch of agents with different goals, that use voting as a consensus-building and standardization mechanism to decide what rules they want enforced, and contribute resources towards the costs of that enforcement."

When it comes to notions of fairness, I think we agree that there is no single standard which applies in all domains in all places. I would frame it as an XKCD 927 situation, where there are multiple standards being applied in different jurisdictions, and within the same jurisdiction when it comes to different domains. (E.g. restitution vs damages.)

When it comes to a fungible resource like money or pie, I believe Yudkowsky's take is "a fair split is an equal split of the resource itself." One third each for three people deciding how to split a pie. There are well-defined extensions for different types of non-fungibility, and the type of "fairness" achieved seems to be domain-specific.

There are also results in game theory regarding "what does a good outcome for bargaining games look like?" These are also well-defined, and requiring different axioms leads to different bargaining solutions. My current favorite way of defining "fairness" for a bargaining game is the Kalai-Smorodinsky bargaining solution. At the meta-level I'm more confident about the attractive qualities of Yudkowsky's probabilistic rejection model. Which includes working pretty well even when participants disagree about how to define "fairness", and not giving anyone an incentive to exaggerate what they think is fair for them to receive. (Source might contain spoilers for Project Lawful but Yudkowsky describes the probabilistic rejection model here, and I discuss it more here.)

Applying Yudkowsky's Algorithm to the labor scenario you described might look like having more fairness-oriented negotiations about "under what circumstances a worker can be fired", "what compensation fired workers can expect to receive", and "how much additional work can other workers be expected to perform without an increase in marginal compensation rate." That negotiation might happen at the level of individual workers, unions, labor regulations, or a convoluted patchwork of those and more. I think historically we've made significant gains in defining and enforcing standards for things like fair wages and adequate safety.

Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-22T18:12:40.737Z · LW · GW

This might be a miscommunication, I meant something like "you and I individually might agree that some cost-cutting measures are good and some cost-cutting measures are bad."

Agents probably also have an instrumental reason to coordinate on defining and enforcing standards for things like fair wages and adequate safety, where some agents might otherwise have an incentive to enrich themselves at the expense of others.

Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-22T18:00:40.379Z · LW · GW

Oops, when I heard about it I'd gotten the impression that this had been adopted by at least one AI firm, even a minor one, but I also can't find anything suggesting that's the case. Thank you!

It looks like OpenAI has split into a nonprofit organization and a "capped-profit" company.

The fundamental idea of OpenAI LP is that investors and employees can get a capped return if we succeed at our mission, which allows us to raise investment capital and attract employees with startup-like equity. But any returns beyond that amount—and if we are successful, we expect to generate orders of magnitude more value than we’d owe to people who invest in or work at OpenAI LP—are owned by the original OpenAI Nonprofit entity.

OpenAI Nonprofit could act like the Future of Life Instutute's proposed Windfall Trust, and a binding commitment to do so would be a Windfall Clause. They could also do something else prosocial with those profits, consistent with their nonprofit status.

Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-22T16:37:30.828Z · LW · GW

I think we agree that in cases where competition is leading to good results, no change to the dynamics is called for.

We probably also agree on a lot of background value judgements like "when businesses become more competitive by spending less on things no one wants, like waste or pollution, that's great!" And "when businesses become more competitive by spending less on things people want, like fair wages or adequate safety, that's not great and intervention is called for."

One case where we might literally want to distribute resources from the makers of a valuable product, to their competitors and society at large, is the development of Artificial General Intelligence (AGI). One of the big causes for concern here is that the natural dynamics might be winner-take-all, leading to an arms race that sacrifices spending on safety in favor of spending on increased capabilities or an earlier launch date.

If instead all AGI developers believed that the gains from AGI development would be spread out much more evenly, this might help to divert spending away from increasing capabilities and deploying as soon as possible, and towards making sure that deployment is done safely. Many AI firms have already voluntarily signed Windfall Clauses, committing to share significant fractions of the wealth generated by successful AGI development.

EDIT: At the time of writing, it looks like Windfall Clauses have been advocated for but not adopted. Thank you Richard_Kennaway for the correction!

Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-22T15:51:05.676Z · LW · GW

For games without these mechanisms, the rational outcomes don't end up that pleasant.  Except sometimes, with players who have extra-rational motives.

I think we agree that if a selfish agent needs to be forced to not treat others poorly, in the absence of such enforcement they will treat others poorly.

It also seems like in many cases, selfish agents have an incentive to create exactly those mechanisms ensuring good outcomes for everyone, because it leads to good outcomes for them in particular. A nation-state comprised entirely of very selfish people would look a lot different from any modern country, but they face the same instrumental reasons to pool resources to enforce laws. The more inclined their populace is towards mistreatment in the absence of enforcement, the higher those enforcement costs need to be in order to achieve the same level of good treatment.

I also think "fairness" is a Schelling point that even selfish agents can agree to coordinate around, in a way that they could never be aligned on "maximizing Zaire's utility in particular." They don't need to value fairness directly to agree that "an equal split of resources is the only compromise we're all going to agree on during this negotiation."

So I think my optimism comes from at least two places:

  • Even utterly selfish agents still have an incentive to create mechanisms enforcing good outcomes for everyone.
  • People have at least some altruism, and are willing to pay costs to prevent mistreatment of others in many cases.
Comment by StrivingForLegibility on Incorporating Justice Theory into Decision Theory · 2024-01-22T00:03:22.231Z · LW · GW

That sounds reasonable to me! This could be another negative externality that we judge to be acceptable, and that we don't want to internalize. Something like "if you break any of these rules, (e.g. worker safety, corporate espionage, etc.) then you owe the affected parties compensation. But as long as you follow the rules, there is no consensus-recognized debt."

Comment by StrivingForLegibility on Optimization Markets · 2024-01-02T03:52:48.773Z · LW · GW

It seems straightforward! Kaggle is the closest example I've been able to think of. But yes that's totally the sort of thing that I think would constitute an optimization market!

Comment by StrivingForLegibility on How Emergency Medicine Solves the Alignment Problem · 2023-12-26T16:53:32.338Z · LW · GW

Absolutely! I have less experience on the "figuring out what interventions are appropriate" side of the medical system, but I know of several safety measures they employ that we can adapt for AI safety.

For example, no actor is unilaterally permitted to think up a novel intervention and start implementing it. They need to convince a institutional review board that the intervention has merit, and that a clinical trial can be performed safely and ethically. Then the intervention needs to be approved by a bunch of bureaucracies like the FDA. And then medical directors can start incorporating that intervention into their protocols.

The AI design paradigm that I'm currently most in favor of, and that I think is compatible with the EMS Agenda 2050, is Drexler's Comprehensive AI Services (CAIS). Where a bunch of narrow AI systems are safely employed to do specific, bounded tasks. A superintelligent system might come up with amazing novel interventions, and collaborate with humans and other superintelligent systems to design a clinical trial for testing them. Every party along the path from invention to deployment can benefit from AI systems helping them perform their roles more safely and effectively.

Comment by StrivingForLegibility on A Decision Theory Can Be Rational or Computable, but Not Both · 2023-12-22T05:41:15.061Z · LW · GW

This is a much more nuanced take! At the beginning of Chapter 6, Jan proposes restricting our attention to agents which are limit computable

Our agents are useless if they cannot be approximated in practice, i.e., by a regular Turing machine. Therefore we posit that any ideal for a ‘perfect agent’ needs to be limit computable ().

This seems like a very reasonable restriction! Any implementation needs to be computable, but it makes sense to look for theoretic ideals which can be approximated.

Comment by StrivingForLegibility on A Decision Theory Can Be Rational or Computable, but Not Both · 2023-12-22T04:47:59.917Z · LW · GW

Yes! I'm a fan of Yudkowsky's view that the sensation of free will is the sensation of "couldness" among multiple actions. When it feels like I could do one thing or another, it feels like I have free will. When it feels like I could have chosen differently, it feels like I chose freely.

I suspect that an important ingredient of the One True Decision Theory is being shaped in such a way that other agents, modelling how you'll respond to different policies they might implement, find it in their interest to implement policies which treat you fairly.

Comment by StrivingForLegibility on Game Theory without Argmax [Part 2] · 2023-11-25T05:16:46.697Z · LW · GW

Got it, I misunderstood the semantics of what  was supposed to capture. I thought the elements needed to be mutual best-responses. Thank you for the clarification, I've updated my implementation accordingly!

Comment by StrivingForLegibility on Game Theory without Argmax [Part 2] · 2023-11-24T02:59:15.204Z · LW · GW

Edit: Cleo Nardo has confirmed that they intended  to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what  was intended to represent. I've updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is based on my initial (wrong) interpretation of .

 

I write the original expression, and your expression rewritten using the OP's notation:

Original:  

Yours: 

(I'm using the notation that a function applied to a set is the image of that set.)

This is a totally clear and valid rewriting using that notation! My background is in programming and I spent a couple minutes trying to figure out how mathematicians write "apply this function to this set."

I believe the way that  is being used is to find Nash equilibria, using Cleo's definition 6.5: 

Like before, the -nash equilibria of  is the set of option-profiles  such that .

These are going to be option-profiles where "not deviating" is considered optimal by every player simultaneously. I agree with your conclusion that this leads  to take on values that are either  or . When , this indicates that  is not a Nash equilibrium. When , we know that  is a Nash equilibrium.

Comment by StrivingForLegibility on Game Theory without Argmax [Part 2] · 2023-11-23T07:49:22.581Z · LW · GW

Edit: Cleo Nardo has confirmed that they intended  to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what  was intended to represent. I've updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is my initial (wrong) interpretation of .

 

I needed to go back to one of the papers cited in Part 1 to understand what that  was doing in that expression. I found the answer in A Generalization of Nash's Theorem with Higher-Order Functionals. I'm going to do my best to paraphrase Hedges' notation into Cleo's notation, to avoid confusion.

 

TLDR:  is picking out the set of option-profiles  that are simultaneously best-responses by all players to that option-profile . It does this by considering all of the option-profiles that can result by each player best-responding, then takes the intersection of those sets.

 

On page 6, Hedges defines the best response correspondence 

 

Where

 

Hedges builds up the idea of Nash Equilibria using quantifiers rather than optimizers, (like  rather than ), but I believe the approaches are equivalent. Unpacking  from the inside out: 

 

That makes ) a -task. Since , we know that .

 

This is where I had to go looking through papers. What sort of product takes a set of best-responses from each player, relative to a given option-profile, and returns a set of option-profiles that are simultaneously regarded by each player as a best-response? I thought about just taking the Cartesian product of the sets, but that wouldn't get us only the mutual best-responses.

 

Let's call the way that each player maps option-profiles to best-responses . This is exactly the sets we want to take the product of:

 

Hedges introduces notation on page 3 to handle the operation of taking an option-profile, varying one player's option, and leaving the rest the same. Paraphrasing, Hedges defines  by 

You can read  as "give me a new copy of , where the  th entry has been set to the value ." Hedges uses this to define the deviation maps equivalently to the way Cleo did.  

 

The correspondences  take as input an option profile, and returns the set of option-profiles which are player 's optimal unilateral deviations from that option profile. To construct  from , we want to map  to the option-profiles which deviate from  in those exact ways. 

 

We can then use Hedges'  to get the best-response correspondence! We can unpack this to get a definition of  using objects that Cleo defined, using that deviation notation from Hedges: 

 

Thank you Cleo for writing this article! This was my first introduction to Higher-Order Game Theory, and I wrote up an implementation in TypeScript to help me understand how all of the pieces fit together!