Posts

Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" 2024-02-29T13:59:34.959Z
mattmacdermott's Shortform 2024-01-03T09:08:14.015Z
What's next for the field of Agent Foundations? 2023-11-30T17:55:13.982Z
Optimisation Measures: Desiderata, Impossibility, Proposals 2023-08-07T15:52:17.624Z
Reward Hacking from a Causal Perspective 2023-07-21T18:27:39.759Z
Incentives from a causal perspective 2023-07-10T17:16:28.373Z
Agency from a causal perspective 2023-06-30T17:37:58.376Z
Introduction to Towards Causal Foundations of Safe AGI 2023-06-12T17:55:24.406Z
Some Summaries of Agent Foundations Work 2023-05-15T16:09:56.364Z
Towards Measures of Optimisation 2023-05-12T15:29:33.325Z
Normative vs Descriptive Models of Agency 2023-02-02T20:28:28.701Z

Comments

Comment by mattmacdermott on Examples of Highly Counterfactual Discoveries? · 2024-04-26T21:16:47.551Z · LW · GW

Lucius-Alexander SLT dialogue?

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-03-04T10:06:51.017Z · LW · GW

Yep, exactly.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-03-03T14:39:28.592Z · LW · GW

Neural network interpretability feels like it should be called neural network interpretation.

Comment by mattmacdermott on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-02T23:09:47.736Z · LW · GW

If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.

I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.

Comment by mattmacdermott on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-02T00:17:13.924Z · LW · GW

I agree this agenda wants to solve ELK by making legible predictive models.

But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.

Comment by mattmacdermott on Shortform · 2024-03-01T19:03:18.670Z · LW · GW

It's not just a lesswrong thing (wikipedia).

My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:

  1. the state or quality of being dedicated to a cause, activity, etc.
  2. an engagement or obligation that restricts freedom of action

Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.

Comment by mattmacdermott on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-01T12:01:34.049Z · LW · GW

tutorial

Comment by mattmacdermott on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-01T11:34:51.433Z · LW · GW

IIUC, I think that in addition to making predictive models more human interpretable, there's another way this agenda aspires to get around the ELK problem.

Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.

IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the "right" notion of what constitutes a bad outcome, rather than a wrong one like "a human's best guess would be that this outcome is bad". I think Bengio's proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the "right" notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.

EDIT: from a quick look at the ELK report, this idea ("ensembling") is mentioned under "How we'd approach ELK in practice". Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it's plausible to me that this agenda could make progress on ELK even if it wasn't specifically thinking about the problem.

Comment by mattmacdermott on Benito's Shortform Feed · 2024-02-27T15:29:59.299Z · LW · GW

"Bear in mind he could be wrong" works well for telling somebody else to track a hypothesis.

"I'm bearing in mind he could be wrong" is slightly clunkier but works ok.

Comment by mattmacdermott on A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] · 2024-02-23T10:59:21.126Z · LW · GW

Really great post. The pictures are all broken for me, though.

Comment by mattmacdermott on Dreams of AI alignment: The danger of suggestive names · 2024-02-13T09:21:26.000Z · LW · GW

He means the second one.

Seems true in the extreme (if you have 0 idea what something is how can you reasonably be worried about it), but less strange the futher you get from that.

Comment by mattmacdermott on Dreams of AI alignment: The danger of suggestive names · 2024-02-12T19:14:10.903Z · LW · GW

Somewhat related: how do we not have separate words for these two meanings of 'maximise'?

  1. literally set something to its maximum value
  2. try to set it to a big value, the bigger the better

Even what I've written for (2) doesn't feel like it unambiguously captures the generally understood meaning of 'maximise' in common phrases like 'RL algorithms maximise reward' or 'I'm trying to maximise my income'. I think the really precise version would be 'try to affect something, having a preference ordering over outcomes which is monotonic in their size'.

But surely this concept deserves a single word. Does anyone know a good word for this, or feel like coining one?

Comment by mattmacdermott on And All the Shoggoths Merely Players · 2024-02-12T17:46:20.271Z · LW · GW

you can train on MNIST digits with twenty wrong labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label

I know some pigeons who would question this claim

Comment by mattmacdermott on Dreams of AI alignment: The danger of suggestive names · 2024-02-11T17:24:31.134Z · LW · GW

I have read many of your posts on these topics, appreciate them, and I get value from the model of you in my head that periodically checks for these sorts of reasoning mistakes.

But I worry that the focus on 'bad terminology' rather than reasoning mistakes themselves is misguided.

To choose the most clear cut example, I'm quite confident that when I say 'expectation' I mean 'weighted average over a probability distribution' and not 'anticipation of an inner consciousness'. Perhaps some people conflate the two, in which case it's useful to disabuse them of the confusion, but I really would not like it to become the case that every time I said 'expectation' I had to add a caveat to prove I know the difference, lest I get 'corrected' or sneered at.

For a probably more contentious example, I'm also reasonably confident that when I use the phrase 'the purpose of RL is to maximise reward', the thing I mean by it is something you wouldn't object to, and which does not cause me confusion. And I think those words are a straightforward way to say the thing I mean. I agree that some people have mistaken heuristics for thinking about RL, but I doubt you would disagree very strongly with mine, and yet if I was to talk to you about RL I feel I would be walking on eggshells trying to use long-winded language in such a way as to not get me marked down as one of 'those idiots'.

I wonder if it's better, as a general rule, to focus on policing arguments rather than language? If somebody uses terminology you dislike to generate a flawed reasoning step and arrive at a wrong conclusion, then you should be able to demonstrate the mistake by unpacking the terminology into your preferred version, and it's a fair cop.

But until you've seen them use it to reason poorly, perhaps it's a good norm to assume they're not confused about things, even if the terminology feels like it has misleading connotations to you.

Comment by mattmacdermott on Choosing a book on causality · 2024-02-07T21:25:32.656Z · LW · GW
Comment by mattmacdermott on Choosing a book on causality · 2024-02-07T21:23:54.641Z · LW · GW

Causal Inference in Statistics (pdf) is much shorter, and a pretty easy read.

I have not read causality but I think you should probably read the primer first and then decide if you need to read that too.

Comment by mattmacdermott on Leading The Parade · 2024-02-01T20:26:14.498Z · LW · GW

Sorry, yeah, my comment was quite ambiguous.

I meant that while gaining status might be a questionable first step in a plan to have impact, gaining skill is pretty much an essential one, and in particular getting an ML PhD or working at a big lab seem like quite solid plans for gaining skill.

i.e. if you replace status with skill I agree with the quotes instead of John.

Comment by mattmacdermott on Leading The Parade · 2024-02-01T16:39:14.899Z · LW · GW

People occasionally come up with plans like "I'll lead the parade for a while, thereby accumulating high status. Then, I'll use that high status to counterfactually influence things!". This is one subcategory of a more general class of plans: "I'll chase status for a while, then use that status to counterfactually influence things!". Various versions of this often come from EAs who are planning to get a machine learning PhD or work at one of the big three AI labs.

This but skill instead of status?

Comment by mattmacdermott on A model of research skill · 2024-01-25T11:20:40.010Z · LW · GW

Any other biography suggestions?

Comment by mattmacdermott on 1a3orn's Shortform · 2024-01-18T08:29:54.691Z · LW · GW

I think Just Don't Build Agents could be a win-win here. All the fun of AGI without the washing up, if it's enforceable.

Possible ways to enforce it:

(1) Galaxy-brained AI methods like Davidad's night watchman. Downside: scary, hard.

(2) Ordinary human methods like requring all large training runs to be approved by the No Agents committee.

Downside: we'd have to ban not just training agents, but training any system that could plausibly be used to build an agent, which might well include oracle-ish AI like LLMs. Possibly something like Bengio's scientist AI might be allowed.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-01-18T08:09:51.975Z · LW · GW

LW feature I would like: I click a button on a sequence and recieve one post in my email inbox per day.

Comment by mattmacdermott on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-14T19:44:17.225Z · LW · GW

I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words, AGI threat scenario "window dressing," or when players from an EA-coded group research a topic.

At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.

Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I'm not sure that newcomers dropping the heuristic would lead to better results.

Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-01-03T13:44:44.219Z · LW · GW

Perhaps I don't understand it, but this seems quite far-fetched to me and I'd be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-01-03T11:59:53.834Z · LW · GW

Can’t we avoid this just by being careful about credit assignment?

If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.

If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.

This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-01-03T11:19:29.106Z · LW · GW

I think the most obvious route to building an oracle is to combine a massive self-supervised predictive model with a question-answering head.

What’s still difficult here is getting a training signal that incentives truthfulness rather than sycophancy, which is I think is what ARC‘s ELK stuff wants (wanted?) to address. Really good mechinterp, new inherently interpretable architectures, or inductive bias-jitsu are other potential approaches.

But the other difficult aspects of the alignment problem (avoiding deceptive alignment, goalcraft) seem to just go away when you drop the agency.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-01-03T10:31:46.941Z · LW · GW

AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.

Isn’t this a central example of “somebody else will surely build agentic AI?”.

I guess it argues “building safe non-agentic AI before somebody else builds agentic AI is difficult” because agents have a capability advantage.

This may well be true (but also perhaps not, because e.g. agents might have capability disadvantages from misalignment, or because reinforcement learning is just harder than other forms of ML).

But either way I think it has importantly different strategy implications to “it seems difficult to make non-agentic AI safe”.

Comment by mattmacdermott on mattmacdermott's Shortform · 2024-01-03T09:08:14.193Z · LW · GW

I feel like there's a bit of a motte and bailey in AI risk discussion, where the bailey is "building safe non-agentic AI is difficult" and the motte is "somebody else will surely build agentic AI".

Are there any really compelling arguments for the bailey? If not then I think "build an oracle and ask it how to avoid risk from other people building agents" is an excellent alignment plan.

Comment by mattmacdermott on Nathan Young's Shortform · 2024-01-02T12:16:05.257Z · LW · GW

I might have misunderstood you, but I wonder if you're mixing up calculating the self-information or surpisal of an outcome with the information gain on updating your beliefs from one distribution to another.

An outcome which has probability 50% contains bit of self-information, and an outcome which has probability 75% contains bits, which seems to be what you've calculated.

But since you're talking about the bits of information between two probabilities I think the situation you have in mind is that I've started with 50% credence in some proposition A, and ended up with 25% (or 75%). To calculate the information gained here, we need to find the entropy of our initial belief distribution, and subtract the entropy of our final beliefs. The entropy of our beliefs about A is .

So for 50% -> 25% it's

And for 50%->75% it's

So your intuition is correct: these give the same answer.

Comment by mattmacdermott on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-27T22:58:09.853Z · LW · GW

Sorry, on reflection I had that wrong.

When distributing probability over outcomes, both arithmetic and geometric maximisation want to put as much probability as possible on the highest payoff outcome. It's when distributing payoffs over outcomes (e.g. deciding what bets to make) that geometric maximisation wants to distribution-match them to your probabilities.

Comment by mattmacdermott on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-26T00:12:11.474Z · LW · GW

If we consider the relation between utility functions and probability distributions, it gets even more literal. An utility function over X could be viewed as a target probability distribution over X, and maximizing expected utility is equivalent to minimizing cross-entropy between this target distribution and the real distribution.

This view can be a bit misleading, since it makes it sound like EU-maxing is like minimising H(u,p): making the real distribution similar to the target distribution.

But actually it’s like minimising H(p,u): putting as much probability as possible on the mode of the target distribution.

(Although interestingly geometric EU-maximising is actually equivalent to minimising H(u,p)/making the real distribution similar to the target.)

EDIT: Last line is wrong, see below.

Comment by mattmacdermott on Think carefully before calling RL policies "agents" · 2023-12-13T08:10:40.867Z · LW · GW

I think this is often worth it in personal and blog-post communication, but I wouldn't say "reinforcement function" in e.g. a conference paper.

I came across this in Ng and Russell (2000) yesterday, and searching for it I see it's reasonably common. You could probably get away with it.

Comment by mattmacdermott on What's next for the field of Agent Foundations? · 2023-12-02T11:21:02.269Z · LW · GW

Oops, thanks, I’ve changed it to Reverse MATS to avoid confusion.

Comment by mattmacdermott on Vote on Interesting Disagreements · 2023-11-08T14:07:12.166Z · LW · GW

Empirical agent foundations is currently a good idea for a research direction.

Comment by mattmacdermott on Vote on Interesting Disagreements · 2023-11-08T14:03:23.636Z · LW · GW

'Descriptive' agent foundations research is currently more important to work on than 'normative' agent foundations research.

Comment by mattmacdermott on Vote on Interesting Disagreements · 2023-11-08T14:01:39.303Z · LW · GW

The work of agency-adjacent research communities such as artificial life, complexity science and active inference is at least as relevant to AI alignment as LessWrong-style agent foundations research is.

Comment by mattmacdermott on Vote on Interesting Disagreements · 2023-11-08T13:57:52.602Z · LW · GW

Agent foundations research should become more academic on the margin (for example by increasing the paper to blogpost ratio, and by putting more effort into relating new work to existing literature).

Comment by mattmacdermott on Optimisation Measures: Desiderata, Impossibility, Proposals · 2023-08-15T13:18:35.189Z · LW · GW

Changed 1, thanks.

You definitely wouldn't want to drop invariance, I think. Probably zero for unchanged expected utility and strict monotocity could go, but I think you would need a conceptual argument about what you want OP to measure in order to constrain the search space a bit.

Comment by mattmacdermott on Optimisation Measures: Desiderata, Impossibility, Proposals · 2023-08-15T13:11:23.143Z · LW · GW

Is the general point that optimisation power should be about how difficult a state of affairs is to achieve, not how desirable it is?

I think that's very reasonable. The intuition going the other way is that maybe we only want to credit useful optimisation. If you neither enjoy robbing banks nor make much money from it, maybe I'm not that impressed about the fact you can do it, even if it's objectively difficult to pull off.

Another point is that we can sort of use the desirability of the state of affairs someone manages to achieve as a proxy for how wide a range of options they had at their disposal. This doesn't apply to the difficulty of achieving the state of affairs, since we don't expect people to be optimising for difficulty. This is an afterthought, though, and maybe there would be better ways to try to measure someone's range of options.

Comment by mattmacdermott on Optimisation Measures: Desiderata, Impossibility, Proposals · 2023-08-07T16:02:21.148Z · LW · GW

Thanks, should be fixed now.

It's not that we needed to add a translation here to end up with the right definition of  in terms of , but with the way we had written it  wasn't a well-defined function of equivalence classes. We had restated proposition 1 to try to make things cleaner, but turns out it messed things up so we've reverted to the previous statement. Hopefully it should all work now.

Comment by mattmacdermott on Towards Measures of Optimisation · 2023-07-17T15:54:57.657Z · LW · GW

The above formulas rely on comparing the actual world to a fixed counterfactual baseline. Gaining more information about the actual world might make the distance between the counterfactual baseline and the actual world grow smaller, but it also might make it grow bigger, so it's not the case that the optimisation power goes to zero as my uncertainty about the world decreases. You can play with the formulas and see.

But maybe your objection is not so much that the formulas actually spit out zero, but that if I become very confident about what the world is like, it stops being coherent to imagine it being different? This would be a general argument against using counterfactuals to define anything. I'm not convinced of it, but if you like you can purge all talk of imagining the world being different, and just say that measuring optimisation power requires a controlled experiment: set up the messy room, record what happens when you put the robot in it, set the room up the same, and record what happens with no robot.

Comment by mattmacdermott on What to read on the "informal multi-world model"? · 2023-07-11T17:09:25.053Z · LW · GW

Despite the similar terminology, people on this site usually aren't talking about the many worlds interpretation of quantum mechanics when they say things like "in 50% of worlds the coin comes up heads".

The overwhelmingly dominant use of probabilities on this website is the subjective Bayesian one i.e. using probabilities to report degrees of belief. You can think of your beliefs about how the coin will turn out as a distribution over possible worlds, and the result of the coin flip as giving you information about which world you inhabit. This turns out to be a nice intuitive way to think about things, especially when it comes to doing an informal version of Bayesian updating in your head.

This has nothing really to do with quantum mechanics. The worlds don't need to have any correspondence to the worlds of the many-worlds interpretation, and I would still think and talk like this regardless of what I believed about QM.

It probably comes from modal logic, where it's standard terminology to talk about worlds which some proposition is true. From a quick google this goes back to at least CI Lewis (1943), which predates the many-worlds interpretation of quantum mechanics, and probably further. Here's the wikipedia page on possible worlds. Probably there's a good resource which explains the terminology in a subjective probability context, but I can't find one right now.

Comment by mattmacdermott on Consciousness as a conflationary alliance term for intrinsically valued internal experiences · 2023-07-10T16:52:34.482Z · LW · GW

Could it be that everyone’s talking about the same thing, but it’s just hard to pin the concept down in words?

I think you’d get similarly varied answers if you asked people what they mean by ‘art’, but I think they’re all basically talking about the same phenomenon.

Comment by mattmacdermott on Utility Maximization = Description Length Minimization · 2023-06-28T22:58:49.091Z · LW · GW

It's worth emphasising just how closely related it is. Fristons' expected free energy of a policy is, where the first term is the expected information gained by following the policy and the second the expected 'extrinsic value'. 

The extrinsic value term , translated into John's notation and setup, is precisely . Where John has optimisers choosing  to minimise the cross-entropy of  under  with respect to  under , Friston has agents choosing  to minimise the cross-entropy of preferences () with respect to beliefs ().

What's more, Friston explicitly thinks of the extrinsic value term  as a way of writing expected utility (see the image below from one of his talks). In particular  is a way of representing real-valued preferences as a probability distribution. He often constucts  by writing down a utility function and then taking a softmax (like in this rat T-maze example), which is exactly what John's construction amounts to.

It seems that John is completely right when he speculates that he's rediscovered an idea well-known to Karl Friston.

 An image where Friston observes that ignoring the information gain (or 'intrinsic value') term in expected free energy gets you expected utility.

Comment by mattmacdermott on Towards Measures of Optimisation · 2023-05-21T15:56:31.878Z · LW · GW

We’re already comparing to the default outcome in that we’re asking “what fraction of the default expected utility minus the worst comes from outcomes at least this good?”.

I think you’re proposing to replace “the worst” with “the default”, in which case we end up dividing by zero.

We could pick some other new reference point other than the worst, but different to the default expected utility. (But that does introduce the possibility of negative OP and still have sensitivity issues).

Comment by mattmacdermott on Towards Measures of Optimisation · 2023-05-12T21:43:44.150Z · LW · GW

Nice, I'd read the first but didn't realise there were more. I'll digest later.

I think agents vs optimisation is definitely reality-carving, but not sure I see the point about utility functions and preference orderings. I assume the idea is that an optimisation process just moves the world towards states, but an agent tries to move the world towards certain states i.e. chooses actions based on how much they move the world towards certain states, so it make sense to quantify how much of a weighting each state gets in its decision-making. But it's not obvious to me that there's not a meaningful way to assign weightings to states for an optimisation process too - for example if a ball rolling down a hill gets stuck in the large hole twice as often as it gets stuck in the medium hole and ten times as often as the small hole, maybe it makes sense to quantify this with something like a utility function. Although defining a utility function based on the typical behaviour of the system and then trying to measure its optimisation power against it gets a bit circular.

Anyway, the dynamical systems approach seems good. Have you stopped working on it?

Comment by mattmacdermott on Towards Measures of Optimisation · 2023-05-12T21:02:23.725Z · LW · GW

Probably the easy utility function makes agent 1 have more optimisation power. I agree this means comparisons between different utility functions can be unfair, but not sure why that rules out a measure which is invariant under positive affine transformations of a particular utility function?

Comment by mattmacdermott on Towards Measures of Optimisation · 2023-05-12T20:53:50.393Z · LW · GW

Hm, I'm not sure this problem comes up.

Say I've built a room-tidying robot, and I want to measure its optimisation power. The room can be in two states: tidy or untidy. A natural choice of default distribution is my beliefs about how tidy the room will be if I don't put the robot in it. Let's assume I'm pretty knowledgeable and I'm extremely confident that in that case the room will be untidy: and (we do have to avoid probabilities of 0, but that's standard in a Bayesian context). But really I do put the robot in and it gets the room tidy, for an optimisation power of bits.

That 11 bits doesn't come from any uncertainty on my part about the optimisation process, although it does depend on my uncertainty about what would happen in the counterfactual world where I don't put the robot in the room. But becoming more confident that the room would be untidy in that world makes me see the robot as more of an optimiser.

Unlike in information theory, these bits aren't measuring a resolution of uncertainty, but a difference between the world and a counterfactual.

Comment by mattmacdermott on The ground of optimization · 2023-05-09T09:59:56.713Z · LW · GW

An interesting point about the agency-as-retargetable-optimisation idea is that it seems like you can make the perturbation in various places upstream of the agent's decision-making, but not downstream, i.e. you can retarget an agent by perturbing its sensors more easily than its actuators.

For example, to change a thermostat-controlled heating system to optimise for a higher temperature, the most natural perturbation might be to turn the temperature dial up, but you could also tamper with its thermistor so that it reports lower temperatures. On the other hand, making its heating element more powerful wouldn't affect the final temperature.

I wonder if this suggests that an agent's goal lives in the last place in a causal chain of things you can perturb to change the set of target states of the system.

Comment by mattmacdermott on Normative vs Descriptive Models of Agency · 2023-02-09T06:35:58.862Z · LW · GW

Do you expect useful generic descriptive models of agency to exist?

Comment by mattmacdermott on Normative vs Descriptive Models of Agency · 2023-02-09T06:23:21.316Z · LW · GW

Nice, thanks. It seems like the distinction the authors make between 'building agents from the ground up' and 'understanding their behaviour and predicting roughly what they will do' maps to the distinction I'm making, but I'm not convinced by the claim that the second one is a much stronger version of the first.

The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built is meant to be some theoretical ideal of rationality. Building a perfect chess algorithm is just a different task to summarising the way an arbitrary algorithm plays chess (which you could attempt without even knowing the rules).