Posts

Is instrumental convergence a thing for virtue-driven agents? 2025-04-02T03:59:20.064Z

Validating against a misalignment detector is very different to training against one 2025-03-04T15:41:04.692Z

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? 2025-02-24T18:31:48.580Z

Context-dependent consequentialism 2024-11-04T09:29:24.310Z

Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) 2024-09-01T07:46:26.647Z

Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" 2024-02-29T13:59:34.959Z

mattmacdermott's Shortform 2024-01-03T09:08:14.015Z

What's next for the field of Agent Foundations? 2023-11-30T17:55:13.982Z

Optimisation Measures: Desiderata, Impossibility, Proposals 2023-08-07T15:52:17.624Z

Reward Hacking from a Causal Perspective 2023-07-21T18:27:39.759Z

Incentives from a causal perspective 2023-07-10T17:16:28.373Z

Agency from a causal perspective 2023-06-30T17:37:58.376Z

Introduction to Towards Causal Foundations of Safe AGI 2023-06-12T17:55:24.406Z

Some Summaries of Agent Foundations Work 2023-05-15T16:09:56.364Z

Towards Measures of Optimisation 2023-05-12T15:29:33.325Z

Normative vs Descriptive Models of Agency 2023-02-02T20:28:28.701Z

Comments

Comment by mattmacdermott on rats's Shortform · 2025-04-08T22:38:18.833Z · LW · GW

I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).

Comment by mattmacdermott on rats's Shortform · 2025-04-08T18:00:46.792Z · LW · GW

just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.

This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that's the whole paradigm).

Comment by mattmacdermott on A Slow Guide to Confronting Doom · 2025-04-07T00:09:43.646Z · LW · GW

If what you mean is you can't be that confident given disagreement, I dunno, I wish I could have that much faith in people.

In another way, being that confident despite disagreement requires faith in people — yourself and the others who agree with you.

I think one reason I have a much lower p(doom) than some people is that although I think the AI safety community is great, I don’t have that much more faith in its epistemics than everyone else’s.

Comment by mattmacdermott on mattmacdermott's Shortform · 2025-04-06T15:48:28.092Z · LW · GW

Circular Consequentialism

When I was a kid I used to love playing RuneScape. One day I had what seemed like a deep insight. Why did I want to kill enemies and complete quests? In order to level up and get better equipment. Why did I want to level up and get better equipment? In order to kill enemies and complete quests. It all seemed a bit empty and circular. I don't think I stopped playing RuneScape after that, but I would think about it every now and again and it would give me pause. In hindsight, my motivations weren't really circular — I was playing Runescape in order to have fun, and the rest was just instrumental to that.

But my question now is — is a true circular consequentialist a stable agent that can exist in the world? An agent that wants X only because it leads to Y, and wants Y only because it leads to X?

Note, I don't think this is the same as a circular preferences situation. The agent isn't swapping X for Y and then Y for X over and over again, ready to get money pumped by some clever observer. It's getting more and more X, and more and more Y over time.

Obviously if it terminally cares about both X and Y or cares about them both instrumentally for some other purpose, a normal consequentialist could display this behaviour. But do you even need terminal goals here? Can you have an agent that only cares about X instrumentally for its effect on Y, and only cares about Y instrumentally for its effect on X. In order for this to be different to just caring about X and Y terminally, I think a necessary property is that the only path through which the agent is trying to increase Y via is X, and the only path through which it's trying to increase X via is Y.

Comment by mattmacdermott on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-04T16:53:39.924Z · LW · GW

Maybe you could spell this out a bit more? What concretely do you mean when you say that anything that outputs decisions implies a utility function — are you thinking of a certain mathematical result/procedure?

Comment by mattmacdermott on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-03T05:06:37.616Z · LW · GW

anything that outputs decisions implies a utility function

I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.

Comment by mattmacdermott on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T15:27:09.237Z · LW · GW

I think this generalises too much from ChatGPT, and also reads to much into ChatGPT's nature from the experiment, but it's a small piece of evidence.

Comment by mattmacdermott on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T15:23:20.071Z · LW · GW

I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us.

Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI.

Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.

Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn't be scary). But I suppose you'd say that's just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here).

it's likely that some of the tasks will be difficult and unbounded

I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.

Comment by mattmacdermott on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T15:07:54.385Z · LW · GW

I think this gets at the heart of the question (but doesn't consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?

I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.

Comment by mattmacdermott on LWLW's Shortform · 2025-03-30T19:53:52.615Z · LW · GW

If you have the time look up “Terence Tao” on Gwern’s website.

In case anyone else is going looking, here is the relevant account of Tao as a child and here is a screenshot of the most relevant part:

Comment by mattmacdermott on mattmacdermott's Shortform · 2025-03-24T23:53:24.133Z · LW · GW

Why does ChatGPT voice-to-text keep translating me into Welsh?

I use ChatGPT voice-to-text all the time. About 1% of the time, the message I record in English gets seemingly-roughly-correctly translated into Welsh, and ChatGPT replies in Welsh. Sometimes my messages go back to English on the next message, and sometimes they stay in Welsh for a while. Has anyone else experienced this?

Example: https://chatgpt.com/share/67e1f11e-4624-800a-b9cd-70dee98c6d4e

Comment by mattmacdermott on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-24T23:07:06.359Z · LW · GW

I think the commenter is asking something a bit different - about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces?

Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc

Comment by mattmacdermott on Towards a scale-free theory of intelligent agency · 2025-03-21T17:16:14.425Z · LW · GW

I vaguely remember a LessWrong comment from you a couple of years ago saying that you included Agent Foundations in the AGISF course as a compromise despite not thinking it’s a useful research direction.

Could you say something about why you’ve changed your mind, or what the nuance is if you haven’t?

Comment by mattmacdermott on gwern's Shortform · 2025-03-10T14:57:23.633Z · LW · GW

I'd be curious to know if conditioning on high agreement alone had less of this effect than conditioning on high karma alone (because something many people agree on is unlikely to be a claim of novel evidence, and more likely to be a take.

Comment by mattmacdermott on Validating against a misalignment detector is very different to training against one · 2025-03-07T19:03:36.741Z · LW · GW

Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

Comment by mattmacdermott on groblegark's Shortform · 2025-03-07T09:23:12.788Z · LW · GW

More or less, yes. But I don't think it suggests there might be other prompts around that unlock similar improvements -- chain-of-thought works because it allows the model to spend more serial compute on a problem, rather than because of something really important about the words.

Comment by mattmacdermott on On OpenAI’s Safety and Alignment Philosophy · 2025-03-06T22:08:57.239Z · LW · GW

Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.

Comment by mattmacdermott on On OpenAI’s Safety and Alignment Philosophy · 2025-03-06T18:07:48.285Z · LW · GW

One difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn't as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.

Comment by mattmacdermott on What Is The Alignment Problem? · 2025-03-06T12:51:58.429Z · LW · GW

I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.

Even granting that, do you think the same applies to the cognition of an AI created using deep learning -- is it approximating Solomonoff induction when presented with a new problem at inference time?

I think it's not, for reasons like the ones in aysja's comment.

Comment by mattmacdermott on Validating against a misalignment detector is very different to training against one · 2025-03-04T17:00:49.306Z · LW · GW

Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.

Comment by mattmacdermott on How might we safely pass the buck to AI? · 2025-03-04T15:48:29.681Z · LW · GW

I wrote it out as a post here.

Comment by mattmacdermott on What goals will AIs have? A list of hypotheses · 2025-03-04T08:36:06.766Z · LW · GW

I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list.

Fair enough, yeah -- this seems like a very reasonable angle of attack.

Comment by mattmacdermott on What goals will AIs have? A list of hypotheses · 2025-03-03T22:39:39.469Z · LW · GW

It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.

As written, aren't Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?

Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like 'proxy virtues' could maybe be a thing too?

(Unrelatedly, it's not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I'm missing something).

Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues."

I probably wouldn't prefer any of those to goals. I might use "Motivations", but I also think it's ok to use goals in this broader way and "consequentialist goals" when you want to make the distinction.

Comment by mattmacdermott on What goals will AIs have? A list of hypotheses · 2025-03-03T21:23:13.762Z · LW · GW

One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.

In this post you use 'goals' in quite a broad way, so as to include stuff like virtues (e.g. "always be honest"). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it's motivated primarily by things like virtues, habits, or rules.

This would be the most important axis to hypothesise about if it was the case that instrumental convergence applies to consequentialist goals but not to things like virtues. Like, I think it's plausible that

(i) if you get an AI with a slightly wrong consequentialist goal (e.g. "maximise everyone's schmellbeing") then you get paperclipped because of instrumental convergence,

(ii) if you get an AI that tries to embody slightly the wrong virtue (e.g. "always be schmonest") then it's badly dysfunctional but doesn't directly entail a disaster.

And if that's correct, then we should care about the question "Will the AI's goals be consequentialist ones?" more than most questions about them.

Comment by mattmacdermott on Daniel Kokotajlo's Shortform · 2025-02-28T17:34:06.792Z · LW · GW

You know you’re feeling the AGI when a compelling answer to “What’s the best argument for very short AI timelines?” lengthens your timelines

Comment by mattmacdermott on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-28T09:49:32.633Z · LW · GW

Interesting. My handwavey rationalisation for this would be something like:

there's some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
for simple triggers, the circuitry is very inactive in the absense of the trigger, so it's unaffected by normal training
for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it's more affected by normal training

Comment by mattmacdermott on Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? · 2025-02-27T10:26:43.391Z · LW · GW

I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there's a sort of no free lunch theorem: either you don't act on the outputs of the oracle at all, or it's just as much of an agent as any other system.

I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you're going to take in response to its outputs, its outputs have to be different. But the point of an oracle is to not have that property -- its outputs are decided by a criterion (something like truth) -- that is independent of the actions you're going to take in response^[1]. So if you respond differently to the outputs, they cause different outcomes. Assuming you've succeeded at building the oracle to specification, it's clearly not the case that the oracle has the worrying property of agents just because you act on its outputs.

I don't disagree that by either hooking the oracle up in a scaffolded feedback loop with the environment, or getting it to output plans, you could extract more agency from it. Of the two I think the scaffolding can in principle easily produce dangerous agency in the same way long-horizon RL can, but that the version where you get it to output a plan is much less worrrying (I can argue for that in a separate comment if you like).

I'm ignoring the self-fulfilling prophecy case here. ↩︎

Comment by mattmacdermott on The non-tribal tribes · 2025-02-26T19:19:16.287Z · LW · GW

“It seems silly to choose your values and behaviors and preferences just because they’re arbitrarily connected to your social group.”

If you think this way, then you’re already on the outside.

I don’t think this is true — your average person would agree with the quote (if asked) and deny that it applies to them.

Comment by mattmacdermott on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-26T11:40:34.460Z · LW · GW

Finetuning generalises a lot but not to removing backdoors?

Comment by mattmacdermott on Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? · 2025-02-25T22:10:00.654Z · LW · GW

Seems like we don’t really disagree

Comment by mattmacdermott on Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? · 2025-02-25T16:17:04.334Z · LW · GW

The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against

the distinctions drawn here between RL and the science AI all break down at high levels.

It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you're training and the outcomes being achieved.

At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.

Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.

And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all.

At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it's a powerful agent. It's a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely -- its input-output behaviour wasn't selected to be effective at causing certain outcomes.

Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don't think they're representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that's an importantly different situation compared to the top of the spectrum.

And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident -- we don't understand enough to rule that out. But I don't see how you end up judging the probability that we get a highly agentic system to be more or less the same wherever we are on the spectrum (if you do)? Or perhaps it's just that you think the distinction is not being handled carefully in the paper?

Comment by mattmacdermott on Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? · 2025-02-25T08:15:49.821Z · LW · GW

Pre-training, finetuning and RL are all types of training. But sure, expand 'train' to 'create' in order to include anything else like scaffolding. The point is it's not what you do in response to the outputs of the system, it's what the system tries to do.

Comment by mattmacdermott on Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? · 2025-02-24T21:14:53.961Z · LW · GW

Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?

(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)

Comment by mattmacdermott on How might we safely pass the buck to AI? · 2025-02-22T08:56:06.315Z · LW · GW

I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.

You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.

Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).

One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.

A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.

Comment by mattmacdermott on How might we safely pass the buck to AI? · 2025-02-20T20:11:13.145Z · LW · GW

Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it.

I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.

However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.

EDIT: later in the thread you say this this "is in some sense approximately the only and central core of the alignment problem". I'm wondering whether thinking about this validation vs training point might cause you a nontrivial update then?

Comment by mattmacdermott on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-24T21:00:01.335Z · LW · GW

For some reason I've been muttering the phrase, "instrumental goals all the way up" to myself for about a year, so I'm glad somebody's come up with an idea to attach it to.

Comment by mattmacdermott on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-24T20:54:44.040Z · LW · GW

One time I was camping in the woods with some friends. We were sat around the fire in the middle of the night, listening to the sound of the woods, when one of my friends got out a bluetooth speaker and started playing donk at full volume (donk is a kind of funny, somewhat obnoxious style of dance music).

I strongly felt that this was a bad bad bad thing to be doing, and was basically pleading with my friend to turn it off. Everyone else thought it was funny and that I was being a bit dramatic -- there was nobody around for hundreds of metres, so we weren't disturbing anyone.

I think my friends felt that because we were away from people, we weren't "stepping on the toes of any instrumentally convergent subgoals" with our noise pollution. Whereas I had the vague feeling that we were disturbing all these squirrels and pigeons and or whatever that were probably sleeping in the trees, so we were "stepping on the toes of instrumentally convergent subgoals" to an awful degree.

Which is all to say, for happy instrumental convergence to be good news for other agents in your vicinity, it seems like you probably do still need to care about those agents for some reason?

Comment by mattmacdermott on MONA: Managed Myopia with Approval Feedback · 2025-01-24T17:30:17.749Z · LW · GW

I'd like to do some experiments using your loan application setting. Is it possible to share the dataset?

Comment by mattmacdermott on mattmacdermott's Shortform · 2025-01-24T13:28:32.216Z · LW · GW

But - what might the model that AGI uses to downright visibility and serve up ideas look like?

What I was meaning to get at is that your brain is an AGI that does this for you automatically.

Comment by mattmacdermott on TsviBT's Shortform · 2025-01-23T10:45:26.673Z · LW · GW

Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).

I do think it's a weaker reason than the second one. The following argument in defence of it is mainly for fun:

I slightly have the feeling that it's like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth... You'll be fine unless you eat the whole apple, in which case you'll be poisoned. Each time you're offered a piece it's rational to take it, but following that policy means you get poisoned.

The analogy is that I consider living for eternity to be scary, and you say, "well, you can stop any time". True, but it's always going to be rational for me to live for one more year, and that way lies eternity.

Comment by mattmacdermott on TsviBT's Shortform · 2025-01-23T09:14:00.053Z · LW · GW

Other (more compelling to me) reasons for being a "deathist":

Eternity can seem kinda terrifying.
In particular, death is insurance against the worst outcomes lasting forever. Things will always return to neutral eventually and stay there.

Comment by mattmacdermott on mattmacdermott's Shortform · 2025-01-22T17:35:23.749Z · LW · GW

Your brain is for holding ideas, not for having them

Notes systems are nice for storing ideas but they tend to get clogged up with stuff you don't need, and you might never see the stuff you do need again. Wouldn't it be better if

ideas got automatically downweighted in visibility over time according to their importance, as judged by an AGI who is intimately familiar with every aspect of your life
ideas got automatically served up to you at relevant moments as judged by that AGI.

Your brain is that notes system. On the other hand, writing notes is a great way to come up with new ideas.

Comment by mattmacdermott on The Case Against AI Control Research · 2025-01-22T13:44:27.743Z · LW · GW

and nobody else ever seems to do anything useful as a result of such fights

I would guess a large fraction of the potential value of debating these things comes from its impact on people who aren’t the main proponents of the research program, but are observers deciding on their own direction.

Is that priced in to the feeling that the debates don’t lead anywhere useful?

Comment by mattmacdermott on quetzal_rainbow's Shortform · 2025-01-19T11:03:06.803Z · LW · GW

The notion of ‘fairness’ discussed in e.g. the FDT paper is something like: it’s fair to respond to your policy, i.e. what you would do in any counterfactual situation, but it’s not fair to respond to the way that policy is decided.

I think the hope is that you might get a result like “for all fair decision problems, decision-making procedure A is better than decision-making procedure B by some criterion to do with the outcomes it leads to”.

Without the fairness assumption you could create an instant counterexample to any such result by writing down a decision problem where decision-making procedure A is explicitly penalised e.g. omega checks if you use A and gives you minus a million points if so.

Comment by mattmacdermott on gwern's Shortform · 2025-01-17T09:25:46.116Z · LW · GW

a Bayesian interpretation where you don't need to renormalize after every likelihood computation

How does this differ from using Bayes' rule in odds ratio form? In that case you only ever have to renormalise if at some point you want to convert to probabilities.

Comment by mattmacdermott on Implications of the inference scaling paradigm for AI safety · 2025-01-14T11:33:06.463Z · LW · GW

I think all of the following:

process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it's using
combining process-based feedback with outcome-based feedback:
- pushes extra hard against legibility because it incentivises obfuscating said strategies^[1]
- unclear sign wrt faithfulness

or at least has the potential to, depending on the details. ↩︎

Comment by mattmacdermott on Tips On Empirical Research Slides · 2025-01-13T08:19:47.966Z · LW · GW

I found this really useful, thanks! I especially appreciate details like how much time you spent on slides at first, and how much you do now.

Comment by mattmacdermott on Selfmaker662's Shortform · 2024-12-31T11:12:19.532Z · LW · GW

Relevant keyword: I think the term for interactions like this where players have an incentive to misreport their preferences in order to bring about their desired outcome is “not strategyproof”.

Comment by mattmacdermott on o3 · 2024-12-21T11:58:38.436Z · LW · GW

The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected.

Like if you had thrown 100 coins and then revealed that 80 were heads.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-12-20T08:48:25.962Z · LW · GW

Could mech interp ever be as good as chain of thought?

Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly - not exactly - explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.

Doesn't this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don't think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?

Seems like 99+ to me.

User info