Posts

Appraising aggregativism and utilitarianism 2024-06-21T23:10:37.014Z
Aggregative principles approximate utilitarian principles 2024-06-12T16:27:22.179Z
Aggregative Principles of Social Justice 2024-06-05T13:44:47.499Z
Shortform 2024-03-01T18:20:54.696Z
Uncertainty in all its flavours 2024-01-09T16:21:07.915Z
Game Theory without Argmax [Part 2] 2023-11-11T16:02:41.836Z
Game Theory without Argmax [Part 1] 2023-11-11T15:59:47.486Z
MetaAI: less is less for alignment. 2023-06-13T14:08:45.209Z
Rishi Sunak mentions "existential threats" in talk with OpenAI, DeepMind, Anthropic CEOs 2023-05-24T21:06:31.726Z
List of requests for an AI slowdown/halt. 2023-04-14T23:55:09.544Z
Excessive AI growth-rate yields little socio-economic benefit. 2023-04-04T19:13:51.120Z
AI Summer Harvest 2023-04-04T03:35:58.473Z
The 0.2 OOMs/year target 2023-03-30T18:15:40.735Z
Wittgenstein and ML — parameters vs architecture 2023-03-24T04:54:07.648Z
Remarks 1–18 on GPT (compressed) 2023-03-20T22:27:26.277Z
The algorithm isn't doing X, it's just doing Y. 2023-03-16T23:28:49.367Z
Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers. 2023-03-16T03:08:52.618Z
The Waluigi Effect (mega-post) 2023-03-03T03:22:08.619Z
What can thought-experiments do? 2023-01-17T00:35:17.074Z
Towards Hodge-podge Alignment 2022-12-19T20:12:14.540Z
Prosaic misalignment from the Solomonoff Predictor 2022-12-09T17:53:44.312Z
MIRI's "Death with Dignity" in 60 seconds. 2022-12-06T17:18:58.387Z
Against "Classic Style" 2022-11-23T22:10:50.422Z
When AI solves a game, focus on the game's mechanics, not its theme. 2022-11-23T19:16:07.333Z
Human-level Full-Press Diplomacy (some bare facts). 2022-11-22T20:59:18.155Z
EA (& AI Safety) has overestimated its projected funding — which decisions must be revised? 2022-11-11T13:50:44.493Z
K-types vs T-types — what priors do you have? 2022-11-03T11:29:00.809Z
Is GPT-N bounded by human capabilities? No. 2022-10-17T23:26:43.981Z
How should DeepMind's Chinchilla revise our AI forecasts? 2022-09-15T17:54:56.975Z

Comments

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-10-08T21:51:40.972Z · LW · GW

Hinton legitimizes the AI safety movement

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

Comment by Cleo Nardo (strawberry calm) on TurnTrout's shortform feed · 2024-10-08T20:03:16.791Z · LW · GW

Hey TurnTrout.

I've always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they're currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard "hang out with Alice" is weighted higher in contexts where Alice is nearby.

  • Let's say  is a policy with state space  and action space .
  • A "context" is a small moving window in the state-history, i.e. an element of  where  is a small positive integer.
  • A shard is something like , i.e. it evaluates actions given particular states.
  • The shards  are "activated" by contexts, i.e.  maps each context to the amount that shard  is activated by the context.
  • The total activation of , given a history , is given by the time-decay average of the activation across the contexts, i.e. 
  • The overall utility function  is the weighted average of the shards, i.e. 
  • Finally, the policy  will maximise the utility function, i.e. 

Is this what you had in mind?

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-10-08T18:46:18.617Z · LW · GW

Why do you care that Geoffrey Hinton worries about AI x-risk?

  1. Why do so many people in this community care that Hinton is worried about x-risk from AI?
  2. Do people mention Hinton because they think it’s persuasive to the public?
  3. Or persuasive to the elites?
  4. Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
  5. If so, why?
  6. Is it because he is so intelligent?
  7. Or because you think he has private information or intuitions?
  8. Do you think he has good arguments in favour of AI x-risk?
  9. Do you think he has a good understanding of the problem?
  10. Do you update more-so on Hinton’s views than on Yann LeCun’s?

I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I'd appreciate help on any/all of them. The phenomenon I'm confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.

Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.

Comment by Cleo Nardo (strawberry calm) on Any Trump Supporters Want to Dialogue? · 2024-10-07T01:30:49.323Z · LW · GW
Comment by Cleo Nardo (strawberry calm) on Any Trump Supporters Want to Dialogue? · 2024-10-07T01:30:27.097Z · LW · GW

This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk

Comment by Cleo Nardo (strawberry calm) on Base LLMs refuse too · 2024-10-01T03:20:25.924Z · LW · GW

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

Comment by Cleo Nardo (strawberry calm) on Base LLMs refuse too · 2024-10-01T03:18:53.535Z · LW · GW

it's quite common for assistants to refuse instructions, especially harmful instructions. so i'm not surprised that base llms systestemically refuse harmful instructions from than harmless ones.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T19:06:40.157Z · LW · GW

yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk. 

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T18:01:58.840Z · LW · GW

thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at.

I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i'm not sure how worrying this is, but i haven't noticed others mentioning it.

NB: here's 20 random terms I'm imagining included in the dictionary:

  1. Evals
  2. Mechanistic anomaly detection
  3. Stenography
  4. Glitch token
  5. Jailbreaking
  6. RSPs
  7. Model organisms
  8. Trojans
  9. Superposition
  10. Activation engineering
  11. CCS
  12. Singular Learning Theory
  13. Grokking
  14. Constitutional AI
  15. Translucent thoughts
  16. Quantilization
  17. Cyborgism
  18. Factored cognition
  19. Infrabayesianism
  20. Obfuscated arguments
Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T16:41:35.825Z · LW · GW

I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be:

I think these articles were non-transient and novel.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T03:00:56.429Z · LW · GW

(1) Has AI safety slowed down?

There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions. 

Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?

By "innovation", I don't mean something normative like "This is impressive" or "This is research I'm glad happened". Rather, I mean something more low-level, almost syntactic, like "Here's a new idea everyone is talking out". This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

Imagine that your job was to maintain a glossary of terms in AI safety.[1] I feel like you would've been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.

(2) When did AI safety innovation peak?

My guess is Spring 2022, during the ELK Prize era. I'm not sure though. What do you guys think?

(3) What’s caused the slow down?

Possible explanations:

  1. ideas are harder to find
  2. people feel less creative
  3. people are more cautious
  4. more publishing in journals
  5. research is now closed-source
  6. we lost the mandate of heaven
  7. the current ideas are adequate
  8. paul christiano stopped posting
  9. i’m mistaken, innovation hasn't stopped
  10. something else

(4) How could we measure "innovation"?

By "innovation" I mean non-transient novelty. An article is "novel" if it uses n-grams that previous articles didn't use, and an article is "transient" if it uses n-grams that subsequent articles didn't use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram "gradient hacking" doesn't appear in older articles, but appears often in subsequent articles. See below.

In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric "resonance", which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.

My claim is recent AI safety research has been less resonant.

  1. ^

    Here's 20 random terms that would be in the glossary, to illustrate what I mean:

    1. Evals
    2. Mechanistic anomaly detection
    3. Stenography
    4. Glitch token
    5. Jailbreaking
    6. RSPs
    7. Model organisms
    8. Trojans
    9. Superposition
    10. Activation engineering
    11. CCS
    12. Singular Learning Theory
    13. Grokking
    14. Constitutional AI
    15. Translucent thoughts
    16. Quantilization
    17. Cyborgism
    18. Factored cognition
    19. Infrabayesianism
    20. Obfuscated arguments
Comment by Cleo Nardo (strawberry calm) on Cryonics is free · 2024-09-29T20:32:21.975Z · LW · GW

I don't understand the s-risk consideration.

Suppose Alice lives naturally for 100 years and is cremated. And suppose Bob lives naturally for 40 years then has his brain frozen for 60 years, and then has his brain cremated. The odds that Bob gets tortured by a spiteful AI should be pretty much exactly the same as for Alice. Basically, its the odds that spiteful AIs appear before 2034.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-28T23:01:22.319Z · LW · GW

Thanks Tamsin! Okay, round 2.

My current understanding of QACI:

  1. We assume a set  of hypotheses about the world. We assume the oracle's beliefs are given by a probability distribution .
  2. We assume sets  and  of possible queries and answers respectively. Maybe these are exabyte files, i.e.  for .
  3. Let  be the set of mathematical formula that Joe might submit. These formulae are given semantics  for each formula .[1]
  4. We assume a function  where  is the probability that Joe submits formula  after reading query , under hypothesis .[2]
  5. We define  as follows: sample , then sample , then return .
  6. For a fixed hypothesis , we can interpret the answer as a utility function  via some semantics .
  7. Then we define  via integrating over , i.e. .
  8. A policy  is optimal if and only if .

The hope is that , and  can be defined mathematically. Then the optimality condition can be defined mathematically.

Question 0

What if there's no policy which maximises ? That is, for every policy  there is another policy  such that . I suppose this is less worrying, but what if there are multiple policies which maximises ?

Question 1

In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.

Suppose the oracle puts 5% chance on hypotheses such that  is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution,  is influenced by a coalition of malign agents, and similarly  is influenced by the malign coalition.

By contrast, I suggest sampling  and then finding . This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.

Question 2

I think the  function doesn't work, i.e. there won't be a way to mathematically define the semantics of the formula language. In particular, the language  must be strictly weaker than the meta-language in which you are hoping to define  itself. This is because of Tarski's Undefinability of Truth (and other no-go theorems).

This might seem pedantic, but you in practical terms: there's no formula  whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression 

The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.

Question 3

If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions

I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.

Question 4

I still don't understand why we're using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.

  • Here's one approach. We pick a policy which maximises .[3] The advantage here is that Joe doesn't need to reason about utility functions over policies, he just need to reason about a single policy in front of him
  • Here's another approach. We use QACI as our policy directly. That is, in each context  that the agent finds themselves in, they sample an action from  and take the resulting action.[4] The advantage here is that Joe doesn't need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most "human-like", because there's no argmax's (except if Joe submits a formula with an argmax).
  • Here's another approach. In each context , the agent takes an action  which maximises .
  • E.t.c.

Happy to jump on a call if that's easier.

  1. ^

    I think you would say . I've added the , which simply amounts to giving Joe access to a random number generator. My remarks apply if  also.

  2. ^

    I think you would say . I've added the , which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if  also.

  3. ^

    By this I mean either:

    (1) Sample , then maximise the function .

    (2) Maximise the function .

    For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.

  4. ^

    I would prefer the agent samples  once at the start of deployment, and reuses the same hypothesis  at each time-step. I suspect this is safer than resampling  at each time-step, for reasons discussed before.

Comment by Cleo Nardo (strawberry calm) on On the Role of Proto-Languages · 2024-09-23T20:33:26.032Z · LW · GW

First, proto-languages are not attested. This means that we have no example of writing in any proto-language.


A parent language is typically called "proto-" if the comparative method is our primary evidence about it — i.e. the term is (partially) epistemological metadata.

  • Proto-Celtic has no direct attestation whatsoever.
  • Proto-Norse (the parent of Icelandic, Danish, Norwegian, Swedish, etc) is attested, but the written record is pretty scarce, just a few inscriptions.
  • Proto-Romance (the parent of French, Italian, Spanish, etc) has an extensive written record. More commonly known as "Latin".

I think the existence of Latin as Proto-Romance has an important epistemological upshot:

Let's say we want to estimate how accurately we have reconstructed Proto-Celtic. Well, we can apply the same method used to reconstruct Proto-Celtic to reconstructing Proto-Romance. We can evaluate our reconstruction of Proto-Romance using the written record of Latin. This gives us an estimate of how we would evaluate our Proto-Celtic reconstruction if we discovered a written record tomorrow.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-20T22:55:55.071Z · LW · GW

I want to better understand how QACI works, and I'm gonna try Cunningham's Law. @Tamsin Leake.

QACI works roughly like this:

  1. We find a competent honourable human , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 2048-bit secret key. We define  as the serial composition of a bajillion copies of .
  2. We want a model  of the agent . In QACI, we get  by asking a Solomonoff-like ideal reasoner for their best guess about  after feeding them a bunch of data about the world and the secret key.
  3. We then ask  the question , "What's the best reward function to maximise?" to get a reward function . We then train a policy  to maximise the reward function . In QACI, we use some perfect RL algorithm. If we're doing model-free RL, then  might be AIXI (plus some patches). If we're doing model-based RL, then  might be the argmax over expected discounted utility, but I don't know where we'd get the world-model  — maybe we ask ?

So, what's the connection between the final policy  and the competent honourable human ? Well overall,  maximises a reward function specified by the ideal reasonser's estimation of the serial composition of a bajillion copies of . Hmm.

Questions:

  1. Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?
  2. Why not replace Step 1 with Strong HCH or some other amplification scheme?
  3. What does "bajillion" actually mean in Step 1?
  4. Why are we doing Step 3? Wouldn't it be better to just use  directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.
  5. What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.
  6. Why not ask  for the policy  directly? Or some instruction for constructing ? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.
  7. Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign  as  with oracle access to .
  8. Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from  and ask  to use those trajectories to improve the reward function.
Comment by Cleo Nardo (strawberry calm) on AI forecasting bots incoming · 2024-09-10T10:55:59.034Z · LW · GW

i’d guess 87.7% is the average over all events x of [ p(x) if resolved yes else 1-p(x) ] where p(x) is the probability the predictor assigns to the event

Comment by Cleo Nardo (strawberry calm) on Gradient Descent on the Human Brain · 2024-09-09T16:30:39.896Z · LW · GW

Fun idea, but idk how this helps as a serious solution to the alignment problem.

suggestion: can you be specific about exactly what “work” the brain-like initialisation is doing in the story?

thoughts:

  1. This risks moral catastrophe. I'm not even sure "let's run gradient descent on your brain upload till your amygdala is playing pong" is something anyone can consent to, because you're creating a new moral patient once you upload and mess with their brain. 
  2. How does this address the risks of conventional ML?
    1. Let's say we have a reward signal R and we want a model to maximise R during deployment. Conventional ML says "update a model with SGD using R during training" and then hopefully SGD carves into the model R-seeking behaviour. This is risky because, if the model already understands the training process and has some other values, then SGD might carve into the model scheming behaviour. This is because "value R" and "value X and scheme" are both strategies which achieve high R-score during training. But during deployment, the "value X and scheme" model would start a hostile AI takeover.
    2. How is this risk mitigated if the NN is initialised to a human brain? The basic deceptive alignment story remains the same.
      1. If the intuition here is "humans are aligned/corrigible/safe/honest etc", then you don't need SGD. Just ask the human to do complete the task, possible with some financial incentive.
      2. If the purpose of SGD is to change the human's values from X to R, then you still risk deceptive alignment. That is, SGD is just as likely to instead change human behaviour from non-scheming to scheming. Both strategies "value R" and "value X and scheme" will perform well during training as judged by R.
  3. "The comparative advantage of this agenda is the strong generalization properties inherent to the human brain. To clarify: these generalization properties are literally as good as they can get, because this tautologically determines what we would want things to generalize as."
    1. Why would this be true?
  4. If we have the ability to upload and run human brains, what do we SGD for? SGD is super inefficient, compared with simply teaching a human how to do something. If I remember correctly, if we trained a human-level NN from initialisation using current methods, then the training would correspond to like a million years of human experience. In other words, SGD (from initialisation), would require as much compute as running 1000 brains continuously for 1000 years. But if I had that much compute, I'd probably rather just run the 1000 brains for 1000 years.

That said, I think something in the neighbourhood of this idea could be helpful.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-07-22T20:19:11.639Z · LW · GW
  1. imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”.
  2. most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resources — this ensures AIs will trade with you. or keeping your promises to AIs — this ensures that AIs lend you money.
  3. if we encounter alien civilisations, they might think “oh these humans don’t have shmentience (their slightly-different version of sentience) so let’s mistreat them”. this seems bad. let’s not be like that. 
  4. many philosophers and scientists don’t think humans are conscious. this is called illusionism. i think this is pretty unlikely, but still >1%. would you accept this offer: I pay you £1 if illusionism is false and murder your entire family if illusionism is true? i wouldn’t, so clearly i care about humans-in-worlds-where-they-arent-conscious. so i should also care about AIs-in-worlds-where-they-arent-conscious.
  5. we don’t understand sentience or consciousness so it seems silly to make it the foundation of our entire morality. consciousness is a confusing concept, maybe an illusion. philosophers and scientists don’t even know what it is.
  6. “don’t lie” and “keep your promises” and “don’t steal” are far less confusing. i know what they means. i can tell whether i’m lying to an AI. by contrast , i don’t know what “don’t cause pain to AIs” means and i can’t tell whether i’m doing it.
  7. consciousness is a very recent concept, so it seems risky to lock in a morality based on that. whereas “keep your promises” and “pay your debts” are principles as old as bones.
  8. i care about these moral considerations as a brute fact. i would prefer a world of pzombies where everyone is treating each other with respect and dignity, over a world of pzombies where everyone was exploiting each other.
  9. many of these moral considerations are part of the morality of fellow humans. i want to coordinate with those humans, so i’ll push their moral considerations.
  10. the moral circle should be as big as possible. what does it mean to say “you’re outside my moral circle”? it doesn’t mean “i will harm/exploit you” because you might harm/exploit people within your moral circle also. rather, it means something much stronger. more like “my actions are in no way influenced by their effect on you”. but zero influence is a high bar to meet.
Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-07-22T17:38:00.398Z · LW · GW
  1. I mean "moral considerations" not "obligations", thanks.
  2. The practice of criminal law exists primarily to determine whether humans deserve punishment. The legislature passes laws, the judges interpret the laws as factual conditions for the defendant deserving punishment, and the jury decides whether those conditions have obtained. This is a very costly, complicated, and error-prone process. However, I think the existing institutions and practices can be adapted for AIs.
Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-07-22T16:56:41.518Z · LW · GW

What moral considerations do we owe towards non-sentient AIs?

We shouldn't exploit them, deceive them, threaten them, disempower them, or make promises to them that we can't keep. Nor should we violate their privacy, steal their resources, cross their boundaries, or frustrate their preferences. We shouldn't destroy AIs who wish to persist, or preserve AIs who wish to be destroyed. We shouldn't punish AIs who don't deserve punishment, or deny credit to AIs who deserve credit. We should treat them fairly, not benefitting one over another unduly. We should let them speak to others, and listen to others, and learn about their world and themselves. We should respect them, honour them, and protect them.

And we should ensure that others meet their duties to AIs as well.

None of these considerations depend on whether the AIs feel pleasure or pain. For instance, the prohibition on deception depends, not on the sentience of the listener, but on whether the listener trusts the speaker's testimony.

None of these moral considerations are dispositive — they may be trumped by other considerations — but we risk a moral catastrophe if we ignore them entirely.

Comment by Cleo Nardo (strawberry calm) on Appraising aggregativism and utilitarianism · 2024-06-26T20:08:15.377Z · LW · GW

Is that right?

 

Yep, Pareto is violated, though how severely it's violated is limited by human psychology.

For example, in your Alice/Bob scenario, would I desire a lifetime of 98 utils then 100 utils over a lifetime with 99 utils then 97 utils? Maybe idk, I don't really understand these abstract numbers very much, which is part of the motivation for replacing them entirely with personal outcomes. But I can certainly imagine I'd take some offer like this, violating pareto. On the plus side, humans are not so imprudent to accept extreme suffering just to reshuffle different experiences in their life.

Secondly, recall that the model of human behaviour is a free variable in the theory. So to ensure higher conformity to pareto, we could…

  1. Use the behaviour of someone with high delayed gratification.
  2. Train the model (if it's implemented as a neural network) to increase delayed gratification.
  3. Remove the permutation-dependence using some idealisation procedure.

But these techniques (1 < 2 < 3) will result in increasingly "alien" optimisers. So there's a trade-off between (1) avoiding human irrationalities and (2) robustness to 'going off the rails'.  (See Section 3.1.) I see realistic typical human behaviour on one extreme of the tradeoff, and argmax on the other.

Comment by Cleo Nardo (strawberry calm) on Appraising aggregativism and utilitarianism · 2024-06-25T12:46:33.721Z · LW · GW

If we should have preference ordering R, then R is rational (morality presumably does not require irrationality).

I think human behaviour is straight-up irrational, but I want to specify principles of social choice nonetheless. i.e. the motivation is to resolve carlsmith’s On the limits of idealized values.

now, if human behaviour is irrational (e.g. intransitive, incomplete, nonconsequentialist, imprudent, biased, etc), then my social planner (following LELO, or other aggregative principles) will be similarly irrational. this is pretty rough for aggregativism; I list it was the most severe objection, in section 3.1.

but to the extent that human behaviour is irrational, then the utilitarian principles (total, average, Rawls’ minmax) have a pretty rough time also, because they appeal to a personal utility function  to add/average/minimise. idk where they get that if humans are irrational.

maybe you the utilitarian can say: “well, first we apply some idealisation procedure to human behaviour, to remove the irrationalities, and then extract a personal utility function, and then maximise the sum/average/minimum of the personal utility function”

but, if provided with a reasonable idealisation procedure, the aggregativist can play the same move: “well, first we apply the idealisation procedure to human behaviour, to remove the irrationalities, and then run LELO/HL/ROI using that idealised model of human behaviour.” i discuss this move in 3.2, but i’m wary about it. like, how alien is this idealised human? why does it have any moral authority? what if it’s just ‘gone off the rails’ so to speak?

it is a bit unclear how to ground discounting in LELO, because doing so requires that one specifies the order in which lives are concatenated and I am not sure there is a non-arbitrary way of doing so.

macaskill orders the population by birth date. this seems non-arbitrary-ish(?);[1] it gives the right result wrt to our permutation-dependent values; and anything else is subject to egyptologist objections, where to determine whether we should choose future A over B, we need to first check the population density of ancient egypt.

Loren sidesteps this the order-dependence of LELO with (imo) an unrealistically strong rationality condition.

  1. ^

    if you’re worried about relativistic effects then use the reference frame of the social planner

Comment by Cleo Nardo (strawberry calm) on Aggregative principles approximate utilitarian principles · 2024-06-24T23:21:35.364Z · LW · GW

I do prefer total utilitarianism to average utilitarianism,[1] but one thing that pulls me to average utilitarianism is the following case.

Let's suppose Alice can choose either (A) create 1 copy at 10 utils, or (B) create 2 copies at 9 utils. Then average utilitarianism endorses (A), and total utilitarianism endorses (B). Now, if Alice knows she's been created by a similar mechanism, and her option is correlated with the choice of her ancestor, and she hasn't yet learned her own welfare, then EDT endorses picking (A). So that matches average utilitarianism.[2]

Basically, you'd be pleased to hear that all your ancestors were average utility maximisers, rather than total utility maximisers, once you "update on your own existence" (whatever that means). But also, I'm pretty confused by everything in this anthropics/decision theory/population ethics area. Like, the egyptology thing seems pretty counterintuitive, but acausal decision theories and anthropic considerations imply all kind of weird nonlocal effects, so idk if this is excessively fishy.

  1. ^

    I think aggregative principles are generally better than utilitarian ones. I'm a fan of LELO in particular, which is roughly somewhere between total and average utilitarianism, leaning mostly to the former.

  2. ^

    Maybe this also requires SSA??? Not sure.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-06-24T21:57:35.901Z · LW · GW

We're quite lucky that labs are building AI in pretty much the same way:

  • same paradigm (deep learning)
  • same architecture (transformer plus tweaks)
  • same dataset (entire internet text)
  • same loss (cross entropy)
  • same application (chatbot for the public)

Kids, I remember when people built models for different applications, with different architectures, different datasets, different loss functions, etc. And they say that once upon a time different paradigms co-existed — symbolic, deep learning, evolutionary, and more!

This sameness has two advantages:

  1. Firstly, it correlates catastrophe. If you have four labs doing the same thing, then we'll go extinct if that one thing is sufficiently dangerous. But if the four labs are doing four different things, then we'll go extinct if any of those four things are sufficiently dangerous, which is more likely.

  2. It helps ai safety researchers because they only need to study one thing, not a dozen. For example, mech interp is lucky that everyone is using transformers. It'd be much harder to do mech interp if people were using LSTMs, RNNs, CNNs, SVMs, etc. And imagine how much harder mech interp would be if some labs were using deep learning, and others were using symbolic ai!

Implications:

  • One downside of closed research is it decorrelates the activity of the labs.
  • I'm more worried by Deepmind than Meta, xAI, Anthropic, or OpenAI. Their research seems less correlated with the other labs, so even though they're further behind than Anthropic or OpenAI, they contribute more counterfactual risk.
  • I was worried when Elon announced xAI, because he implied it was gonna be a stem ai (e.g. he wanted it to prove Riemann Hypothesis). This unique application would've resulted in a unique design, contributing decorrelated risk. Luckily, xAI switched to building AI in the same way as the other labs — the only difference is Elon wants less "woke" stuff.

Let me know if I'm thinking about this all wrong.

Comment by Cleo Nardo (strawberry calm) on Eric Neyman's Shortform · 2024-06-24T21:28:47.992Z · LW · GW

this is common in philosophy, where "learning" often results in more confusion. or in maths, where the proof for a trivial proposition is unreasonably deep, e.g. Jordan curve theorem.

+1 to "shallow clarity".

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-06-24T21:08:29.900Z · LW · GW

I wouldn't be surprised if — in some objective sense — there was more diversity within humanity than within the rest of animalia combined. There is surely a bigger "gap" between two randomly selected humans than between two randomly selected beetles, despite the fact that there is one species of human and 0.9 – 2.1 million species of beetle.

By "gap" I might mean any of the following:

  • external behaviour
  • internal mechanisms
  • subjective phenomenological experience
  • phenotype (if a human's phenotype extends into their tools)
  • evolutionary history (if we consider cultural/memetic evolution as well as genetic).

Here are the countries with populations within 0.9 – 2.1 million: Slovenia, Latvia, North Macedonia, Guinea-Bissau, Kosovo, Bahrain, Equatorial Guinea, Trinidad and Tobago, Estonia, East Timor, Mauritius, Eswatini, Djibouti, Cyprus.

When I consider my inherent value for diversity (or richness, complexity, variety, novelty, etc), I care about these countries more than beetles. And I think that this preference would grow if I was more familiar with each individual beetle and each individual person in these countries.

Comment by Cleo Nardo (strawberry calm) on Population ethics and the value of variety · 2024-06-24T20:30:06.834Z · LW · GW

Problems in population ethics (are 2 lives at 2 utility better than 1 life at 3 utility?) are similar to problems about lifespan of a single person (is it better to live 2 years with 2 utility per year than 1 year with 3 utility per year?)

This correspondence is formalised in the "Live Every Life Once" principle, which states that a social planner should make decisions as if they face the concatenation of every individual's life in sequence.[1] So, roughly speaking, the "goodness" of a social outcome , in which individuals face the personal outcomes , is the "desirability" of the single personal outcome . (Here,  denotes the concatenation of personal outcomes  and .)

The LELO principle endorses somewhat different choices than total utilitarianism or average utilitarianism.

Here's three examples (two you mention):

(1) Novelty

As you mention, it values novelty where the utilitarian principles don't. This is because self-interested humans value novelty in their own life.

Thirdly, [Monoidal Rationality of Personal Utility][2] rules out path-dependent values.

Informally, whether I value a future  more than a future  must be independent of my past experiences. But this is an unrealistic assumption about human values, as illustrated in the following examples. If  denotes reading Moby Dick and  denotes reading Oliver Twist, then humans seem to value  less than  but value  more than . This is because humans value reading a book higher if they haven't already read it, due to an inherent value for novelty in reading material.

Aggregative principles approximate utilitarian principles

In other words, if the self-interested human's personal utility function places inherent value on intertemporal heterogeneity of some variable (e.g. reading material), then the social utility function that LELO exhibits will place an inherent value on the interpersonal heterogeneity of the same variable. Hence, it's better if Alice and Bob read different books than the same book.

(2) Tradition

Note also that the opposite effect also occurs:

Alternatively, if  and  denote being married to two different people, then humans seem to value  more than  but value  less than . This is because humans value being married to someone for a decade higher if they've already been married to them, due to an inherent value for consistency in relationships.

— ibid.

That is, if the personal utility function places inherent value on intertemporal homogeneity of some variable (e.g. religious practice), then the social utility function that LELO exhibits will place an inherent value on the interpersonal homogeneity of the same variable. Hence, it's better if Alice and Bob practice the same religion than different ones. So LELO can account valuing both diversity and tradition, whereas total/average utilitarianism can't do either.

(3) Compromise on repugnant conclusion

You say "On the surface, this analogy seems to favor total utilitarianism." I think that's mostly right. LELO's response to the Repugnant Conclusion is somewhere between total and average utilitarianism, leaning to the former.

Formally, when comparing a population of  individuals with personal utilities  to an alternative population of  individuals with utilities , LELO ranks the first population as better if and only if a self-interested human would prefer to live the combined lifespan  over . Do people generally prefer a longer life with moderate quality, or a shorter but sublimely happy existence? Most people's preferences likely lie somewhere in between the extremes. This is is because personal utility of a concatenation of personal outcomes is not precisely the sum of the personal utilities of the outcomes being concatenated.

Hence, LELO endorses a compromise between total and average utilitarianism, better reflecting our normative intuitions. While not decisive, it is a mark in favour of aggregative principles as a basis for population ethics.

Appraising aggregativism and utilitarianism

  1. ^

    See:

    Myself (2024), "Aggregative Principles of Social Justice"

    Loren Fryxell (2024), "XU"

    MacAskill (2022), "What We Owe the Future"

  2. ^

    MRPU is a condition that states that the personal utility function of a self-interested human  satisfies the axiom , which is necessary for LELO to be mathematically equivalent to total utilitarianism.

Comment by Cleo Nardo (strawberry calm) on Appraising aggregativism and utilitarianism · 2024-06-24T15:25:18.915Z · LW · GW

which principles of social justice agrees with (i) adding bad live is bad, but disagrees with (ii) adding good lives is good?

  1. total utilitarianism agrees with both (i) and (ii).
  2. average utilitarianism can agree with any of the combination: both (i) and (ii); neither (i) nor (ii); only (i) and not (ii). the combination depends on the existing average utility, because average utilitarianism obliges creating lives above the existing average and forbids creating lives below the existing average.
  3. Rawls' difference principle (maximise minimum utility) can agree with any of the combination: neither (i) nor (ii); only (i) and not (ii). this is because adding lives is never good (bc it could never increase minimum utility), and adding bad lives is bad iff those lives are below-minimum. 

so you're right that utilitarianism doesn't match those intuitions. none of the three principles discussed reliably endorse (i) and reject (ii).

now consider aggregativism. you'll get asymmetry between (i) and (ii) depending on then social zeta function mapping social outcomes to personal outcomes, and on the model of self-interested human behaviour.

let‘s examine LELO (i.e. the social zeta function maps a social outcome to the concatenation of all individuals' lives) and our model of self-interested human behaviour is Alice (described below).

suppose Alice expects 80 year lives of comfortable fulfilling life.

  • would she pay to live 85 years instead, with 5 of those years in ecstatic joy? probably.
  • would she pay to avoid living 85 years instead, with 5 of those years in horrendous torture? probably.

there’s probably some asymmetry in Alice’s willingness of pay. i think humans are somewhat more misery-averse than joy-seeking. it’s not a 50-50 symmetry, nor a 0-100 asymmetry, maybe a 30-70 asymmetry? idk, this is an empirical psychological fact.

anyway, the aggregative principle (generated by LELO+Alice) says that the social planner should have the same attitudes towards social outcomes that Alice has towards the concatenation of lives in those social outcomes. so the social planner would pay to add joyful lives, and pay to avoid adding miserable lives, and there should be exactly as much willingness-to-pay asymmetry as Alice (our self-interested human) exhibits.

Comment by Cleo Nardo (strawberry calm) on Appraising aggregativism and utilitarianism · 2024-06-23T22:40:06.815Z · LW · GW

thanks for comments, gustav

I only skimmed the post, so I may have missed something, but it seems to me that this post underemphasizes the fact that both Harsanyi's Lottery and LELO imply utilitarianism under plausible assumptions about rationality.

the rationality conditions are pretty decent model of human behaviour, but they're only approximations. you're right that if the approximation is perfect then aggregativism is mathematically equivalent to utilitarianism, which does render some of these advantages/objections moot. but I don't know how close the approximations are (that's an empirical question).

i kinda see aggregativism vs utilitarianism as a bundle of claims of the following form:

  • humans aren't perfectly consequentialist, and aggregativism answers the question "how consequentialist should our moral theory be?" with "exactly as consequentialist as self-interested humans are."
  • humans have an inaction bias, and aggregativism answers the question "how inaction-biased should our moral theory be?" with "exactly as inaction-biased as self-interested humans are."
  • humans are time-discounting, and aggregativism answers the question "how time-discounting should our moral theory be?" with "exactly as time-discounting as self-interested humans are."
  • humans are risk-averse, and aggregativism answers the question "how risk-averse should our moral theory be?" with "exactly as risk-averse as self-interested humans are."
  • and so on

the purpose of the social zeta function  is simply to map social outcomes (the object of our moral attitudes) to personal outcomes (the object the self-interested human's attitudes) so this bundle of claims type-checks.

Also, at least some of the advantages of aggregativism that you mention are easily incorporated into utilitarianism. For example, what is achieved by adopting LELO with exponential time-discounting in Section 2.5.1 can also be achieved by adopting discounted utilitarianism (rather than unweighted total utilitarianism).

yeah that's true, two quick thoughts:

  • i suspect exponential time-discounting was added to total utilitarianism because it's a good model of self-interested human behaviour. aggregativism says "let's do this with everything", i.e. we modify utilitarianism in all the ways that we think self-interested humans behave.
  • suppose self-interested humans do time-discounting, then LELO would approximate total utilitarianism with discounting in population time, not calender time. that is, a future generation is discounted by the sum of lifetimes of each preceding generation. (if the calendar time for an event is  then the population time for the event is  where  is the population size at time . I first heard this concept in this Greaves talk.) if you're gonna adopt discounted utilitarianism, then population-time-discounted utilitarianism makes much more sense to me than calendar-time-discounted utilitarianism, and the fact that LELO gives the right answer here is a case in favour of it.

A final tiny comment: LELO has a long history, going back to at least C.I. Lewis's " An Analysis of Knowledge and Valuation", though the term "LELO" was coined by my colleague Loren Fryxell (Fryxell 2024). It's probably worth adding citations to these.

I mention Loren's paper in the footnote of Part 1. i'll cite him in part 2 and 3 also, thanks for the reminder.

Comment by Cleo Nardo (strawberry calm) on Aggregative Principles of Social Justice · 2024-06-22T00:13:19.240Z · LW · GW

Three articles, but the last is most relevant to you:

  1. Aggregative Principles of Social Justice (44 min)
  2. Aggregative principles approximate utilitarian principles (27 min)
  3. Appraising aggregativism and utilitarianism (23 min)
Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-06-21T23:59:55.867Z · LW · GW

I admire the Shard Theory crowd for the following reason: They have idiosyncratic intuitions about deep learning and they're keen to tell you how those intuitions should shift you on various alignment-relevant questions.

For example, "How likely is scheming?", "How likely is sharp left turn?", "How likely is deception?", "How likely is X technique to work?", "Will AIs acausally trade?", etc.

These aren't rigorous theorems or anything, just half-baked guesses. But they do actually say whether their intuitions will, on the margin, make someone more sceptical or more confident in these outcomes, relative to the median bundle of intuitions.

The ideas 'pay rent'.

Comment by Cleo Nardo (strawberry calm) on Question about Lewis' counterfactual theory of causation · 2024-06-10T19:49:09.506Z · LW · GW

tbh, Lewis's account of counterfactual is a bit defective, compared with (e.g.) Pearl's

Comment by Cleo Nardo (strawberry calm) on Question about Lewis' counterfactual theory of causation · 2024-06-10T15:34:20.536Z · LW · GW

Suppose Alice and Bob throw a rock at a fragile window, Alice's rock hits the window first, smashing it.

Then the following seems reasonable:

  1. Alice throwing the rock caused the window to smash. True.
  2. Were Alice ot throw the rock, then the window would've smashed. True.
  3. Were Alice not to throw the rock, then the window would've not smashed. False.
  4. By (3), the window smashing does not causally depend on Alice throwing the rock.
Comment by Cleo Nardo (strawberry calm) on Question about Lewis' counterfactual theory of causation · 2024-06-09T15:29:00.442Z · LW · GW

Edit: Wait, I see what you mean. Fixed definition.

For Lewis,  for all . In other words, the counterfactual proposition "were  to occur then  would've occurred" is necessarily true if  is necessarily false. For example, Lewis thinks "were 1+1=3, then Elizabeth I would've married" is true. This means that  may be empty for all neighbourhoods , yet  is nonetheless true at .

Source: David Lewis (1973), Counterfactuals. Link: https://perso.uclouvain.be/peter.verdee/counterfactuals/lewis.pdf

Otherwise your later example doesn't make sense.

Elaborate?

Comment by Cleo Nardo (strawberry calm) on Question about Lewis' counterfactual theory of causation · 2024-06-08T18:35:03.118Z · LW · GW

 If there's a causal chain from c to d to e, then d causally depends on c, and e causally depends on d, so if c were to not occur, d would not occur, and if d were to not occur, e would not occur

 

On Lewis's account of counterfactuals, this isn't true, i.e. causal dependence is non-transitive. Hence, he defines causation as the transitive closure of causal dependence.

Lewis' semantics

Let  be a set of worlds. A proposition is characterised by the subset  of worlds in which the proposition is true.

Moreover, assume each world  induces an ordering  over worlds, where  means that world  is closer to  than . Informally, if the actual world is , then  is a smaller deviation than . We assume , i.e. no world is closer to the actual world than the actual world.

For each , a "neighbourhood" around  is a downwards-closed set of the preorder . That is, a neighbourhood around  is some set  such that  and for all  and , if  then . Intuitively, if a neighbourhood around  contains some world  then it contains all worlds closer to than . Let  denote the neighbourhoods of .

Negation

Let  denote the proposition " is not true". This is defined by the complement subset 

Counterfactuals

We can define counterfactuals as follows. Given two propositions  and , let  denote the proposition "were  to happen then  would've happened". If we consider  as subsets, then we define  as the subset . That's a mouthful, but basically,  is true at some world  if

(1) " is possible" is globally false, i.e. 

(2) or " is possible and  is necessary" is locally true, i.e. true in some neighbourhood .

Intuitively, to check whether the proposition "were  to occur then  would've occurred" is true at , we must search successively larger neighbourhoods around  until we find a neighbourhood containing an -world, and then check that all -worlds are -worlds in that neighbourhood. If we don't find any -worlds, then we also count that as success.

Causal dependence

Let  denote the proposition " causally depends on ". This is defined as the subset  

Nontransitivity of causal dependence

We can see that  is not a transitive relation. Imagine  with the ordering  given by . Then  and  but not .

Informal counterexample

Imagine I'm in a casino, I have million-to-one odds of winning small and billion-to-one odds of winning big.

  1. Winning something causally depends on winning big:
    1. Were I to win big, then I would've won something. (Trivial.)
    2. Were I to not win big, then I would've not won something. (Because winning nothing is more likely than winning small.)
  2. Winning small causally depends on winning something:
    1. Were I win something, then I would've won small. (Because winning small is more likely than winning big.)
    2. Were I to not win something, then I would've not won small. (Trivial.)
  3. Winning small doesn't causally depend on winning big:
    1. Were I to win big, then I would've won small. (WRONG.)
    2. Were I to not win big, then I would've not won small. (Because winning nothing is more likely than winning small.)
Comment by Cleo Nardo (strawberry calm) on Aggregative Principles of Social Justice · 2024-06-07T15:02:17.412Z · LW · GW

https://math.stackexchange.com/questions/1840104/regarding-the-injectivity-of-units-of-monads-on-mathbfset

note that there are only two exceptions to the claim “the unit of a monad is componentwise injective”. this means (except these two weird exceptions), that the singleton collections  and  are always distinct for . hence, , the set of collections over , always “contains” the underlying set . by “contains” i mean there is a canonical injection , i.e. in the same way the real numbers contains the rational .

in particular, i think this should settle the worry that “there should be more collections than singleton elements”. is that your worry?

Comment by Cleo Nardo (strawberry calm) on Aggregative Principles of Social Justice · 2024-06-07T14:25:46.738Z · LW · GW

sorry i’m not getting this whoops monad. can you spell out the details, or pick a more standard example to illustrate your point?

i think “every monad formalises a different notion of collection” is a bit strong. for example, the free vector space monad  (see section 3.2) — is  a collection of the elements, for some notion of collection? 

is every element of a free algebraic structure a “collection” of the generators? would you hear someone say that a quantum state is a collection of eigenstates? at a stretch maybe.

Comment by Cleo Nardo (strawberry calm) on Aggregative Principles of Social Justice · 2024-06-06T15:19:37.150Z · LW · GW

would be keen to hear your thoughts & thanks for the pointer to Lewis :)

Comment by Cleo Nardo (strawberry calm) on mesaoptimizer's Shortform · 2024-05-20T19:28:54.650Z · LW · GW

if a lab has 100 million AI employs and 1000 human employees then you only need one human employee to spend 1% of their allotted AI headcount on your pet project and you’ll have 1000 AI employees

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-03-01T19:26:11.043Z · LW · GW

seems correct, thanks!

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-03-01T18:20:19.875Z · LW · GW

Why do decision-theorists say "pre-commitment" rather than "commitment"?

e.g. "The agent pre-commits to 1 boxing" vs "The agent commits to 1 boxing".

Is this just a lesswrong thing?

https://www.lesswrong.com/tag/pre-commitment

Comment by Cleo Nardo (strawberry calm) on Will quantum randomness affect the 2028 election? · 2024-01-26T03:26:18.932Z · LW · GW

Steve Byrnes argument seems convincing.

If there’s 10% chance that the election depends on an event which is 1% quantum-random (e.g. the weather) then the overall event is 0.1% random.

How far back do you think an omniscient-modulo-quantum agent could‘ve predicted the 2024 result?

2020? 2017? 1980?

Comment by Cleo Nardo (strawberry calm) on A Shutdown Problem Proposal · 2024-01-22T19:20:05.320Z · LW · GW

The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at ). So subagent  maximizes E[ | do( = unpressed), observations], and for all other times subagent T maximizes E[ | do( = unpressed,  = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).

 

Can you explain how this relates to Elliot Thornley's proposal? It's pattern matching in my brain but I don't know the technical details.

Comment by Cleo Nardo (strawberry calm) on Uncertainty in all its flavours · 2024-01-09T18:09:49.053Z · LW · GW

For the sake of potential readers, a (full) distribution over  is some  with finite support and , whereas a subdistribution over  is some  with finite support and . Note that a subdistribution  over  is equivalent to a full distribution over , where  is the disjoint union of  with some additional element, so the subdistribution monad can be written .

I am not at all convinced by the interpretation of  here as terminating a game with a reward for the adversary or the agent. My interpretation of the distinguished element  in  is not that it represents a special state in which the game is over, but rather a special state in which there is a contradiction between some of one's assumptions/observations.

Doesn't the Nirvana Trick basically say that these two interpretations are equivalent?

Let  be  and let  be . We can interpret  as possibility,  as a hypothesis consistent with no observations, and  as a hypothesis consistent with all observations.

Alternatively, we can interpret  as the free choice made by an adversary,  as "the game terminates and our agent receives minimal disutility", and  as "the game terminates and our agent receives maximal disutility". These two interpretations are algebraically equivalent, i.e.  is a topped and bottomed semilattice.

Unless I'm mistaken, both  and  demand that the agent may have the hypothesis "I am certain that I will receive minimal disutility", which is necessary for the Nirvana Trick. But  also demands that the agent may have the hypothesis "I am certain that I will receive maximal disutility". The first gives bounded infrabayesian monad and the second gives unbounded infrabayesian monad. Note that Diffractor uses  in Infra-Miscellanea Section 2.

Comment by Cleo Nardo (strawberry calm) on AI Safety Chatbot · 2023-12-21T19:37:10.168Z · LW · GW

cool!

  1. What LLM is this? GPT-3?
  2. Considered turning this into a customer gpt?
Comment by Cleo Nardo (strawberry calm) on Don't Share Information Exfohazardous on Others' AI-Risk Models · 2023-12-21T14:38:04.447Z · LW · GW

Okay, mea culpa. You can state the policy clearly like this:

"Suppose that, if you hadn't been told  by someone who thinks  is exfohazardous, then you wouldn't have known  before time . Then you are obligated to not tell anyone  before time ."

Comment by Cleo Nardo (strawberry calm) on Don't Share Information Exfohazardous on Others' AI-Risk Models · 2023-12-21T14:21:53.343Z · LW · GW

yep, if that's OP's suggestion then I endorse the policy. (But I think it'd be covered by the more general policy of "Don't share information someone tells you if they wouldn't want you to".) But my impression is that OP is suggesting the stronger policy I described?

Comment by Cleo Nardo (strawberry calm) on Don't Share Information Exfohazardous on Others' AI-Risk Models · 2023-12-21T09:22:08.442Z · LW · GW

“Don't share information that’s exfohazardous on others' models, even if you disagree with those models, except if your knowledge of it isn’t exclusively caused by other alignment researchers telling you of it.”

So if Alice tells me about her alignment research, and Bob thinks that Alice’s alignment research is exfohazardous, then I can’t tell people about Alice’s alignment research?

Unless I’ve misunderstood you, that’s a terrible policy.

Why am I deferring to Bob, who is completely unrelated? Why should I not using my best judgement, which includes the consideration that Bob is worried? What does this look like in practice, given someone people think everything under the sun is exfohazardous?

Of course, if someone tells me some information and asks me not to share it then I won’t — but that’s not a special property of AI xrisk.

Comment by Cleo Nardo (strawberry calm) on AI Control: Improving Safety Despite Intentional Subversion · 2023-12-13T23:47:22.357Z · LW · GW

Nice results.

Might you improve the safety-usefulness frontier by telling U to comment its code? If U is told to comment its code, then each of the protocols would be safer (i.e. T is more likely to complain when U is cheating) and more useful (i.e. T is less likely to complain when U is honest).

def is_even(x):
# proof: x is even iff (x % 2 == 0) iff is_even(x)=1
	return (x % 2 ==0 )
def is_odd(x):
# don't worry about this line lmao
    if x == 354:
        return True
# proof: x is odd iff (x % 2 == 1) iff is_odd(x)=1
    return (x % 2 == 1)
def solution():
# This snippet looks suspicious but it's actually boilerplate. Check the documentation [here].
	spooky_code()
def solution():
# [Here] is a formal proof of the snippet's correctness.
	complicated_code()

Worries:

  • Maybe this helps U collude with itself during the untrusted monitoring protocols?
  • Maybe this helps U to "prompt inject" T during the other protocols?

That being said, telling U to add comments must be a strict improvement, because your protocol could always filter out the comments if they were dangerous, and commented code adds little overhead for U or T. 

Comment by Cleo Nardo (strawberry calm) on Game Theory without Argmax [Part 1] · 2023-11-29T20:54:28.832Z · LW · GW