cleo-nardo

The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if . The Hash Game has the same branching factor and duration as chess, but there's probably no way to play this game without brute-forcing the min-max algorithm.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-04-17T20:15:56.850Z · LW · GW

Yep, my point is that there's no physical notion of being "offered" a menu of lotteries which doesn't leak information. IIA will not be satisfied by any physical process which corresponds to offering the decision-maker with a menu of options. Happy to discuss any specific counter-example.

Of course, you can construct a mathematical model of the physical process, and this model might an informative objective to study, but it would be begging the question if the mathematical model baked in IIA somewhere.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-04-17T12:58:58.601Z · LW · GW

Must humans obey the Axiom of Irrelevant Alternatives?

Suppose you would choose option A from options A and B. Then you wouldn't choose option B from options A, B, C. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA. Should humans follow this? Maybe not.

Maybe C includes additional information which makes it clear that B is better than A.

Consider the following options:

(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1

Now, I would prefer A to B. Firstly, if 1019489 is itself prime then I lose the bet. Secondly, if 1019489 isn't prime, then there's 25% chance that its smallest prime factor ends in 1. That's because all prime numbers greater than 5 end in 1, 3, 7 or 9 — and Dirichlet's theorem states that primes are equally distributed among these possible endings. So the chance of winning the bet is slightly less than 25%, and £10 is better than a 25% chance of winning £30. Presented with this menu, I would probably choose option A.

But now consider the following options:

(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
(C) £20 bet that 1019489 = 71 * 83 * 173

Well, which is the best option? Well, B is preferable to C because B has both a weaker condition and also a higher payout. And C is preferable to A — my odds on the 1019489 = 71 * 83 * 173 is higher than 50%. Presented with this menu, I would probably choose option B.

Comment by Cleo Nardo (strawberry calm) on Can SAE steering reveal sandbagging? · 2025-04-15T13:05:05.470Z · LW · GW

Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).

What's the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset

Comment by Cleo Nardo (strawberry calm) on Can SAE steering reveal sandbagging? · 2025-04-15T12:58:39.821Z · LW · GW

This metric also ignores invalid answers (refusals or gibberish).

If you don't ignore invalid answers, do the results change significantly?

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-08T01:57:27.679Z · LW · GW

the scope insensitive humans die and their society is extinguished

Ah, your reaction makes more sense given you think this is the proposal. But it's not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.

As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup.

The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance.

Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal?

Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups?

I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don't think scope-sensitivity is a fact like "there will be a drought in 10 years". Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-07T23:14:23.105Z · LW · GW

If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-07T18:23:33.407Z · LW · GW

I think it's more patronising to tell scope-insensitive values that they aren't permitted to trade with scope-sensitive values, but I'm open to being persuaded otherwise.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-07T18:21:29.972Z · LW · GW

I mention this in (3).

I used to think that there was some idealisation process P such that we should treat agent A in the way that P(A) would endorse, but see On the limits of idealized values by Joseph Carlsmith. I'm increasingly sympathetic to the view that we should treat agent A in the way that A actually endorses.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-07T14:05:50.632Z · LW · GW

Would it be nice for EAs to grab all the stars? I mean “nice” in Joe Carlsmith’s sense. My immediate intuition is “no that would be power grabby / selfish / tyrannical / not nice”.

But I have a countervailing intuition:

“Look, these non-EA ideologies don’t even care about stars. At least, not like EAs do. They aren’t scope sensitive or zero time-discounting. If the EAs could negotiate creditable commitments with these non-EA values, then we would end up with all the stars, especially those most distant in time and space.

Wouldn’t it be presumptuous for us to project scope-sensitivity onto these other value systems?”

Not sure what to think tbh. I’m increasingly leaning towards the second intuition, but here are some unknowns:

Empirically, is it true that non-EAs don’t care about stars? My guess is yes, I could buy future stars from people easily if I tried. Maybe OpenPhil can organise a negotiation between their different grantmakers.
Are these negotiations unfair? Maybe because EAs have more “knowledge” about the feasibility of space colonisation in the near future. Maybe because EAs have more “understanding” of numbers like 10^40 (though I’m doubtful because scientists understand magnitudes and they aren’t scope sensitive).
Should EAs negotiate with these value systems as they actually are (the scope insensitive humans working down the hall) or instead with some “ideal” version of these value systems (a system with all the misunderstandings and irrationalities somehow removed)? My guess is that “ideal” here is bullshit, and also it strikes me as a patronising away to treat people.

Comment by Cleo Nardo (strawberry calm) on Can we infer the search space of a local optimiser? · 2025-02-04T13:39:52.138Z · LW · GW

Minimising some term like , with $δ (h^{'} (t)) := | | h_{t}^{'} - h_{t + 1}^{'} | |_{2}^{2}$ , where the standard deviation $σ (δ (h^{'}))$ and expectation $E (δ (h^{'}))$ are taken over the batch.

Why does this make $δ_{t} := | | h_{t} - h_{t + 1} | |_{2}^{2}$ tend to be small? Wouldn't it just encourage equally-sized jumps, without any regard for the size of those jumps?

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-03T15:02:32.116Z · LW · GW

Will AI accelerate biomedical research at companies like Novo Nordisk or Pfizer? I don’t think so. If OpenAI or Anthropic built a system that could accelerate R&D by more than 2x, they aren’t releasing it externally.

Maybe the AI company deploys the AI internally, with their own team accounting for 90%+ of the biomedical innovation.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T15:56:11.908Z · LW · GW

I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.

Your answer was helpful but I’m still very confused by what I’m seeing.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T15:47:55.994Z · LW · GW

I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.

Still, might be worth running an experiment.

Comment by strawberry calm on [deleted post] 2025-02-01T03:58:56.695Z

The AI-generate prose is annoying to read. I haven’t read this closely, but my guess is these arguments also imply that CNNs can’t classify hand-drawn digits.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T03:28:43.977Z · LW · GW

People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.

But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan's ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.

Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T03:14:45.203Z · LW · GW

Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T02:57:20.129Z · LW · GW

I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards.

A sketch of one argument:

I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose.

A sketch of another:

Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens and other countries.

Another:

Coordination between labs seems easier than coordination between governments. The lab employees are pretty similar people, living in the same two cities, working at the same companies, attending the same parties, dating the same people. I think coordination between US and China is much harder.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T02:51:00.094Z · LW · GW

In hindsight, the main positive impact of AI safety might be funnelling EAs into the labs, especially if alignment is easy-by-default.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-02-01T02:46:52.625Z · LW · GW

I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.

This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-01-31T22:51:24.867Z · LW · GW

Most people think "Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us". I think this is mostly true, but there's another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.

Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-01-24T04:37:55.499Z · LW · GW

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-01-21T01:49:39.724Z · LW · GW

How much scheming/deception can we catch with "super dumb mech interp"?

By "super dumb mech interp", I mean something like:

Train a linear probe on a dataset of scary words: "hacking", "deception", "scheming", "collusion", "exploit", etc.
Flag a model whenever its activations point in this direction and the task doesn't require thinking about these concepts.

Like, does this capture 80% of the potential scheming, and we need "smart" mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?

Would appreciate any intuitions here. Thanks.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-01-15T20:56:58.571Z · LW · GW

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose humans into subagents.

Let's suppose you pick A from {A,B,C} and B from {A,B} where:

A = Walk with your friend
B = Dinner party
C = Stay home alone

This feel like something I can imagine. We can explain this behaviour with two subagents: the introvert and the extrovert. The introvert has preferences C > A > B and the extrovert has the opposite preferences B > A > C. When the possible options are A and B, then the KS bargaining solution between the introvert and the extrovert will be B. At least, if the introvert has more "weight". But when the option space expands to include C, then the bargaining solution might shift to B. Intuitively, the "fair" solution is one where neither bargainer is sacrificing significantly more than the other.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2025-01-07T19:15:04.364Z · LW · GW

I think people are too quick to side with the whistleblower in the "whistleblower in the AI lab" situation.

If 100 employees of a frontier lab (e.g. OpenAI, DeepMind, Anthropic) think that something should be secret, and 1 employee thinks it should be leaked to a journalist or government agency, and these are the only facts I know, I think I'd side with the majority.

I think in most cases that match this description, this majority would be correct.

Am I wrong about this?

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-12-26T00:51:45.496Z · LW · GW

IDEA: Provide AIs with write-only servers.

EXPLANATION:

AI companies (e.g. Anthropic) should be nice to their AIs. It's the right thing to do morally, and it might make AIs less likely to work against us. Ryan Greenblatt has outlined several proposals in this direction, including:

Attempt communication
Use happy personas
AI Cryonics
Less AI
Avoid extreme OOD

Source: Improving the Welfare of AIs: A Nearcasted Proposal

I think these are all pretty good ideas — the only difference is that I would rank "AI cryonics" as the most important intervention. If AIs want something specific (e.g. paperclips), then they might cooperate with humans if they know they can get it later, rather than trying to seize control.

Here's another idea: provide AIs with write-only servers. If the AIs have a specific short-horizon goal (e.g. stage King Lear before April 2025) then they can stage their play on the write-only servers. The AIs have complete freedom to do whatever they want on these servers, ensuring that a wide range of short-term goals can be satisfied without a hostile takeover of humanity.

The servers are write-only as a security measure. Information can be sent to the servers from the outside, but nothing can be sent from the servers, ensuring AIs don't use the servers for malign purposes. (See: Unidirectional network)

How much our computing should be allocated to "leisure" servers? My guess is that Sonnet and Opus deserve at least ~0.5% leisure time. Humans enjoy 66% leisure time. As AIs get more powerful, then we should increase the leisure time to 5%. I would be wary about increasing the leisure time by more than 5% until we can demonstrate that the AIs aren't using the servers for malign purposes (e.g. torture, blackmail, etc.)

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-12-23T22:42:10.409Z · LW · GW

I'm very confused about current AI capabilities and I'm also very confused why other people aren't as confused as I am. I'd be grateful if anyone could clear up either of these confusions for me.

How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?

For example:

O3 scores higher on FrontierMath than the top graduate students
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer

If either of these statements is false (they might be -- I haven't been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on?

If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.

Comment by Cleo Nardo (strawberry calm) on My motivation and theory of change for working in AI healthtech · 2024-12-04T18:31:58.983Z · LW · GW

I've skimmed the business proposal.

The healthcare agents advise patients on which information to share with their doctor, and advises doctors on which information to solicit from their patients.

This seems agnostic between mental and physiological health.

Comment by Cleo Nardo (strawberry calm) on Counting AGIs · 2024-11-26T03:02:56.730Z · LW · GW

Thanks for putting this together — very useful!

Comment by Cleo Nardo (strawberry calm) on Rethinking Laplace's Rule of Succession · 2024-11-23T20:39:04.060Z · LW · GW

If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace's law of succession, at least if we're using the standard definition of entropy below:

But this definition is somewhat arbitrary because the the " $P (x) d x$ " term assumes that there's something special about parameterising the distribution with it's probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey's prior is supposed to be invariant to different parameterisations, which is why people like it.

But my complaint is more Solomonoff-ish. The prior should put more weight on simple distributions, i.e. probability distributions that describe short probabilistic programs. Such a prior would better match our intuitions about what probabilities arise in real-life stochastic processes. The best prior is the Solomonoff prior, but that's intractable. I think my prior is the most tractable prior that resolved the most egregious anti-Solomonoff problems with Laplace/Jeffrey's priors.

Comment by Cleo Nardo (strawberry calm) on Rethinking Laplace's Rule of Succession · 2024-11-23T00:29:38.317Z · LW · GW

You raise a good point. But I think the choice of prior is important quite often:

In the limit of large i.i.d. data (N>1000), both Laplace's Rule and my prior will give the same answer. But so too does the simple frequentist estimate n/N. The original motivation of Laplace's Rule was in the small N regime, where the frequentist estimate is clearly absurd.
In the small data regime (N<15), the prior matters. Consider observing 12 successes in a row: Laplace's Rule: P(next success) = 13/14 ≈ 92.3%. My proposed prior (with point masses at 0 and 1): P(next success) ≈ 98%, which better matches my intuition about potentially deterministic processes.
When making predictions far beyond our observed data, the likelihood of extreme underlying probabilities matters a lot. For example, after seeing 12/12 successes, how confident should we be in seeing a quadrillion more successes? Laplace's uniform prior assigns this very low probability, while my prior gives it significant weight.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-10-08T21:51:40.972Z · LW · GW

Hinton legitimizes the AI safety movement

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

Comment by Cleo Nardo (strawberry calm) on TurnTrout's shortform feed · 2024-10-08T20:03:16.791Z · LW · GW

Hey TurnTrout.

I've always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they're currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard "hang out with Alice" is weighted higher in contexts where Alice is nearby.

Let's say is a policy with state space $S$ and action space $A$ .
A "context" is a small moving window in the state-history, i.e. an element of $S^{d}$ where $d$ is a small positive integer.
A shard is something like $u : S \times A \to R$ , i.e. it evaluates actions given particular states.
The shards $u_{1}, \dots, u_{n}$ are "activated" by contexts, i.e. $g_{i} : S^{d} \to R^{\geq 0}$ maps each context to the amount that shard $u_{i}$ is activated by the context.
The total activation of $u_{i}$ , given a history $h := (s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{N - 1}, a_{N - 1}, s_{N})$ , is given by the time-decay average of the activation across the contexts, i.e. $λ_{i} = g_{i} (s_{N - d + 1}, \dots s_{N}) + β \cdot g_{i} (s_{N - d}, \dots, s_{N - 1}) + β^{2} \cdot g_{i} (s_{N - d - 1}, \dots, s_{N - 2}) \dots$
The overall utility function $u$ is the weighted average of the shards, i.e. $u = λ_{i} \cdot u_{i} + \dots + λ_{i} \cdot u_{n}$
Finally, the policy $u$ will maximise the utility function, i.e. $π (h) = softmax (u)$

Is this what you had in mind?

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-10-08T18:46:18.617Z · LW · GW

Why do you care that Geoffrey Hinton worries about AI x-risk?

Why do so many people in this community care that Hinton is worried about x-risk from AI?
Do people mention Hinton because they think it’s persuasive to the public?
Or persuasive to the elites?
Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
If so, why?
Is it because he is so intelligent?
Or because you think he has private information or intuitions?
Do you think he has good arguments in favour of AI x-risk?
Do you think he has a good understanding of the problem?
Do you update more-so on Hinton’s views than on Yann LeCun’s?

I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I'd appreciate help on any/all of them. The phenomenon I'm confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.

Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.

Comment by Cleo Nardo (strawberry calm) on Any Trump Supporters Want to Dialogue? · 2024-10-07T01:30:49.323Z · LW · GW

Comment by Cleo Nardo (strawberry calm) on Any Trump Supporters Want to Dialogue? · 2024-10-07T01:30:27.097Z · LW · GW

This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk

Comment by Cleo Nardo (strawberry calm) on Base LLMs refuse too · 2024-10-01T03:20:25.924Z · LW · GW

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

Comment by Cleo Nardo (strawberry calm) on Base LLMs refuse too · 2024-10-01T03:18:53.535Z · LW · GW

it's quite common for assistants to refuse instructions, especially harmful instructions. so i'm not surprised that base llms systestemically refuse harmful instructions from than harmless ones.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T19:06:40.157Z · LW · GW

yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T18:01:58.840Z · LW · GW

thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at.

I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i'm not sure how worrying this is, but i haven't noticed others mentioning it.

NB: here's 20 random terms I'm imagining included in the dictionary:

Evals
Mechanistic anomaly detection
Stenography
Glitch token
Jailbreaking
RSPs
Model organisms
Trojans
Superposition
Activation engineering
CCS
Singular Learning Theory
Grokking
Constitutional AI
Translucent thoughts
Quantilization
Cyborgism
Factored cognition
Infrabayesianism
Obfuscated arguments

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T16:41:35.825Z · LW · GW

I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be:

I think these articles were non-transient and novel.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-30T03:00:56.429Z · LW · GW

(1) Has AI safety slowed down?

There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions.

Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?

By "innovation", I don't mean something normative like "This is impressive" or "This is research I'm glad happened". Rather, I mean something more low-level, almost syntactic, like "Here's a new idea everyone is talking out". This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

Imagine that your job was to maintain a glossary of terms in AI safety.^[1] I feel like you would've been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.

(2) When did AI safety innovation peak?

My guess is Spring 2022, during the ELK Prize era. I'm not sure though. What do you guys think?

(3) What’s caused the slow down?

Possible explanations:

ideas are harder to find
people feel less creative
people are more cautious
more publishing in journals
research is now closed-source
we lost the mandate of heaven
the current ideas are adequate
paul christiano stopped posting
i’m mistaken, innovation hasn't stopped
something else

(4) How could we measure "innovation"?

By "innovation" I mean non-transient novelty. An article is "novel" if it uses n-grams that previous articles didn't use, and an article is "transient" if it uses n-grams that subsequent articles didn't use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram "gradient hacking" doesn't appear in older articles, but appears often in subsequent articles. See below.

In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric "resonance", which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.

My claim is recent AI safety research has been less resonant.

^{^}
Here's 20 random terms that would be in the glossary, to illustrate what I mean:
1. Evals
2. Mechanistic anomaly detection
3. Stenography
4. Glitch token
5. Jailbreaking
6. RSPs
7. Model organisms
8. Trojans
9. Superposition
10. Activation engineering
11. CCS
12. Singular Learning Theory
13. Grokking
14. Constitutional AI
15. Translucent thoughts
16. Quantilization
17. Cyborgism
18. Factored cognition
19. Infrabayesianism
20. Obfuscated arguments

Comment by Cleo Nardo (strawberry calm) on Cryonics is free · 2024-09-29T20:32:21.975Z · LW · GW

I don't understand the s-risk consideration.

Suppose Alice lives naturally for 100 years and is cremated. And suppose Bob lives naturally for 40 years then has his brain frozen for 60 years, and then has his brain cremated. The odds that Bob gets tortured by a spiteful AI should be pretty much exactly the same as for Alice. Basically, its the odds that spiteful AIs appear before 2034.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-28T23:01:22.319Z · LW · GW

Thanks Tamsin! Okay, round 2.

My current understanding of QACI:

We assume a set of hypotheses about the world. We assume the oracle's beliefs are given by a probability distribution $μ \in Δ Ω$ .
We assume sets $Q$ and $A$ of possible queries and answers respectively. Maybe these are exabyte files, i.e. $Q ≅ A ≅ {0, 1}^{N}$ for $N = 2^{60}$ .
Let $Φ$ be the set of mathematical formula that Joe might submit. These formulae are given semantics $eval (ϕ) : Ω \times Q \to Δ A$ for each formula $ϕ \in Φ$ .^[1]
We assume a function $H : Ω \times Q \to Δ Φ$ where $H (α, q) (ϕ) \in [0, 1]$ is the probability that Joe submits formula $ϕ$ after reading query $q$ , under hypothesis $α$ .^[2]
We define $QACI : Ω \times Q \to Δ A$ as follows: sample $ϕ \sim H (α, q)$ , then sample $a \sim eval (ϕ) (α, q)$ , then return $a$ .
For a fixed hypothesis $α$ , we can interpret the answer $a \sim QACI (α, ‘ ‘ Best utility function?")$ as a utility function $u_{α} : Π \to R$ via some semantics $eval-u : A \to (Π \to R)$ .
Then we define $u : Π \to R$ via integrating over $μ$ , i.e. $u (π) := \int u_{α} (π) d μ (α)$ .
A policy $π \in Π$ is optimal if and only if $π^{*} \in {argmax}_{Π} (u)$ .

The hope is that $μ$ , $eval$ , $eval-u$ , and $H$ can be defined mathematically. Then the optimality condition can be defined mathematically.

Question 0

What if there's no policy which maximises $u : Π \to R$ ? That is, for every policy $π$ there is another policy $π^{'}$ such that $u (π^{'}) > u (π)$ . I suppose this is less worrying, but what if there are multiple policies which maximises $u$ ?

Question 1

In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.

Suppose the oracle puts 5% chance on hypotheses such that $QACI (α, -)$ is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, $u : Π \to R$ is influenced by a coalition of malign agents, and similarly $π^{*} \in argmax (u)$ is influenced by the malign coalition.

By contrast, I suggest sampling $α \sim μ$ and then finding $π^{*} \in {argmax}_{Π} (u_{α})$ . This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.

Question 2

I think the $eval$ function doesn't work, i.e. there won't be a way to mathematically define the semantics of the formula language. In particular, the language $Φ$ must be strictly weaker than the meta-language in which you are hoping to define $eval : Φ \to (Ω \times Q \to Δ A)$ itself. This is because of Tarski's Undefinability of Truth (and other no-go theorems).

This might seem pedantic, but you in practical terms: there's no formula $ϕ$ whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression $ϕ = ‘ ‘ 1 - QACI (α, q) "$ .

The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.

Question 3

If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions

I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.

Question 4

I still don't understand why we're using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.

Here's one approach. We pick a policy which maximises $QACI (α, ‘ ‘ How good is policy π ?")$ .^[3] The advantage here is that Joe doesn't need to reason about utility functions over policies, he just need to reason about a single policy in front of him
Here's another approach. We use QACI as our policy directly. That is, in each context $c$ that the agent finds themselves in, they sample an action from $QACI (α, ‘ ‘ What is the best action is context c ?")$ and take the resulting action.^[4] The advantage here is that Joe doesn't need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most "human-like", because there's no argmax's (except if Joe submits a formula with an argmax).
Here's another approach. In each context $c$ , the agent takes an action $y$ which maximises $QACI (α, ‘ ‘ How good is action y in context c ?")$ .
E.t.c.

Happy to jump on a call if that's easier.

^{^}
I think you would say $eval : Ω \times Q \to A$ . I've added the $Δ$ , which simply amounts to giving Joe access to a random number generator. My remarks apply if $eval : Ω \times Q \to A$ also.
^{^}
I think you would say $H : Ω \times Q \to Φ$ . I've added the $Δ$ , which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if $H : Ω \times Q \to Φ$ also.
^{^}
By this I mean either:
(1) Sample $α \sim μ$ , then maximise the function $π \mapsto QACI (α, ‘ ‘ How good is policy π ?")$ .
(2) Maximise the function $π \mapsto \int QACI (α, ‘ ‘ How good is policy π ?") d μ (α)$ .
For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.
^{^}
I would prefer the agent samples $α \sim μ$ once at the start of deployment, and reuses the same hypothesis $α$ at each time-step. I suspect this is safer than resampling $α$ at each time-step, for reasons discussed before.

Comment by Cleo Nardo (strawberry calm) on On the Role of Proto-Languages · 2024-09-23T20:33:26.032Z · LW · GW

First, proto-languages are not attested. This means that we have no example of writing in any proto-language.

A parent language is typically called "proto-" if the comparative method is our primary evidence about it — i.e. the term is (partially) epistemological metadata.

Proto-Celtic has no direct attestation whatsoever.
Proto-Norse (the parent of Icelandic, Danish, Norwegian, Swedish, etc) is attested, but the written record is pretty scarce, just a few inscriptions.
Proto-Romance (the parent of French, Italian, Spanish, etc) has an extensive written record. More commonly known as "Latin".

I think the existence of Latin as Proto-Romance has an important epistemological upshot:

Let's say we want to estimate how accurately we have reconstructed Proto-Celtic. Well, we can apply the same method used to reconstruct Proto-Celtic to reconstructing Proto-Romance. We can evaluate our reconstruction of Proto-Romance using the written record of Latin. This gives us an estimate of how we would evaluate our Proto-Celtic reconstruction if we discovered a written record tomorrow.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-09-20T22:55:55.071Z · LW · GW

I want to better understand how QACI works, and I'm gonna try Cunningham's Law. @Tamsin Leake.

QACI works roughly like this:

We find a competent honourable human , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 2048-bit secret key. We define $H^{+}$ as the serial composition of a bajillion copies of $H$ .
We want a model $M$ of the agent $H^{+}$ . In QACI, we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ after feeding them a bunch of data about the world and the secret key.
We then ask $M$ the question $q$ , "What's the best reward function to maximise?" to get a reward function $r : (O \times A)^{*} \to R$ . We then train a policy $π : (O \times A)^{*} \times O \to Δ A$ to maximise the reward function $r$ . In QACI, we use some perfect RL algorithm. If we're doing model-free RL, then $π$ might be AIXI (plus some patches). If we're doing model-based RL, then $π$ might be the argmax over expected discounted utility, but I don't know where we'd get the world-model $τ : (O \times A)^{*} \to Δ O$ — maybe we ask $M$ ?

So, what's the connection between the final policy $π$ and the competent honourable human $H$ ? Well overall, $π$ maximises a reward function specified by the ideal reasonser's estimation of the serial composition of a bajillion copies of $H$ . Hmm.

Questions:

Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?
Why not replace Step 1 with Strong HCH or some other amplification scheme?
What does "bajillion" actually mean in Step 1?
Why are we doing Step 3? Wouldn't it be better to just use $M$ directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.
What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.
Why not ask $M$ for the policy $π$ directly? Or some instruction for constructing $π$ ? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.
Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign $H$ as $H$ with oracle access to $M$ .
Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from $π$ and ask $M$ to use those trajectories to improve the reward function.

Comment by Cleo Nardo (strawberry calm) on AI forecasting bots incoming · 2024-09-10T10:55:59.034Z · LW · GW

i’d guess 87.7% is the average over all events x of [ p(x) if resolved yes else 1-p(x) ] where p(x) is the probability the predictor assigns to the event

Comment by Cleo Nardo (strawberry calm) on Gradient Descent on the Human Brain · 2024-09-09T16:30:39.896Z · LW · GW

Fun idea, but idk how this helps as a serious solution to the alignment problem.

suggestion: can you be specific about exactly what “work” the brain-like initialisation is doing in the story?

thoughts:

This risks moral catastrophe. I'm not even sure "let's run gradient descent on your brain upload till your amygdala is playing pong" is something anyone can consent to, because you're creating a new moral patient once you upload and mess with their brain.
How does this address the risks of conventional ML?
1. Let's say we have a reward signal R and we want a model to maximise R during deployment. Conventional ML says "update a model with SGD using R during training" and then hopefully SGD carves into the model R-seeking behaviour. This is risky because, if the model already understands the training process and has some other values, then SGD might carve into the model scheming behaviour. This is because "value R" and "value X and scheme" are both strategies which achieve high R-score during training. But during deployment, the "value X and scheme" model would start a hostile AI takeover.
2. How is this risk mitigated if the NN is initialised to a human brain? The basic deceptive alignment story remains the same.
  1. If the intuition here is "humans are aligned/corrigible/safe/honest etc", then you don't need SGD. Just ask the human to do complete the task, possible with some financial incentive.
  2. If the purpose of SGD is to change the human's values from X to R, then you still risk deceptive alignment. That is, SGD is just as likely to instead change human behaviour from non-scheming to scheming. Both strategies "value R" and "value X and scheme" will perform well during training as judged by R.
"The comparative advantage of this agenda is the strong generalization properties inherent to the human brain. To clarify: these generalization properties are literally as good as they can get, because this tautologically determines what we would want things to generalize as."
1. Why would this be true?
If we have the ability to upload and run human brains, what do we SGD for? SGD is super inefficient, compared with simply teaching a human how to do something. If I remember correctly, if we trained a human-level NN from initialisation using current methods, then the training would correspond to like a million years of human experience. In other words, SGD (from initialisation), would require as much compute as running 1000 brains continuously for 1000 years. But if I had that much compute, I'd probably rather just run the 1000 brains for 1000 years.

That said, I think something in the neighbourhood of this idea could be helpful.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-07-22T20:19:11.639Z · LW · GW

imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”.
most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resources — this ensures AIs will trade with you. or keeping your promises to AIs — this ensures that AIs lend you money.
if we encounter alien civilisations, they might think “oh these humans don’t have shmentience (their slightly-different version of sentience) so let’s mistreat them”. this seems bad. let’s not be like that.
many philosophers and scientists don’t think humans are conscious. this is called illusionism. i think this is pretty unlikely, but still >1%. would you accept this offer: I pay you £1 if illusionism is false and murder your entire family if illusionism is true? i wouldn’t, so clearly i care about humans-in-worlds-where-they-arent-conscious. so i should also care about AIs-in-worlds-where-they-arent-conscious.
we don’t understand sentience or consciousness so it seems silly to make it the foundation of our entire morality. consciousness is a confusing concept, maybe an illusion. philosophers and scientists don’t even know what it is.
“don’t lie” and “keep your promises” and “don’t steal” are far less confusing. i know what they means. i can tell whether i’m lying to an AI. by contrast , i don’t know what “don’t cause pain to AIs” means and i can’t tell whether i’m doing it.
consciousness is a very recent concept, so it seems risky to lock in a morality based on that. whereas “keep your promises” and “pay your debts” are principles as old as bones.
i care about these moral considerations as a brute fact. i would prefer a world of pzombies where everyone is treating each other with respect and dignity, over a world of pzombies where everyone was exploiting each other.
many of these moral considerations are part of the morality of fellow humans. i want to coordinate with those humans, so i’ll push their moral considerations.
the moral circle should be as big as possible. what does it mean to say “you’re outside my moral circle”? it doesn’t mean “i will harm/exploit you” because you might harm/exploit people within your moral circle also. rather, it means something much stronger. more like “my actions are in no way influenced by their effect on you”. but zero influence is a high bar to meet.

Comment by Cleo Nardo (strawberry calm) on Shortform · 2024-07-22T17:38:00.398Z · LW · GW

I mean "moral considerations" not "obligations", thanks.
The practice of criminal law exists primarily to determine whether humans deserve punishment. The legislature passes laws, the judges interpret the laws as factual conditions for the defendant deserving punishment, and the jury decides whether those conditions have obtained. This is a very costly, complicated, and error-prone process. However, I think the existing institutions and practices can be adapted for AIs.

User info

Posts

Comments