## Posts

Demanding and Designing Aligned Cognitive Architectures 2021-12-21T17:32:57.482Z
Safely controlling the AGI agent reward function 2021-02-17T14:47:00.293Z
Graphical World Models, Counterfactuals, and Machine Learning Agents 2021-02-17T11:07:47.249Z
Disentangling Corrigibility: 2015-2021 2021-02-16T18:01:27.952Z
Creating AGI Safety Interlocks 2021-02-05T12:01:46.221Z
Counterfactual Planning in AGI Systems 2021-02-03T13:54:09.325Z
New paper: AGI Agent Safety by Iteratively Improving the Utility Function 2020-07-15T14:05:11.177Z
The Simulation Epiphany Problem 2019-10-31T22:12:51.323Z
New paper: Corrigibility with Utility Preservation 2019-08-06T19:04:26.386Z

Comment by Koen.Holtman on Scalar reward is not enough for aligned AGI · 2022-01-20T16:47:08.285Z · LW · GW

Reading the paper Reward is Enough, what strikes me most is that the paper is reductionist almost to the point of being a self-parody.

Take a sentence like:

The reward-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

I could rewrite this to

The physics-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as being the laws of physics acting in an environment.

If I do that rewriting throughout the paper, I do not have to change any of the supporting arguments put forward by the authors: they equally support the physics-is-enough reductionist hypothesis.

The authors of 'reward is enough' posit that rewards explain everything, so you might think that they would be very interested in spending more time to look closely at the internal structure of actual reward signals that exist in the wild, or actual reward signals that might be designed. However, they are deeply uninterested in this. In fact they explicitly invite others to join them in solving the 'challenge of sample-efficient reinforcement learning' without ever doing such things.

Like you I feel that, when it comes to AI safety, this lack of interest in the details of reward signals is not very helpful. I like the multi-objective approach (see my comments here), but my own recent work like this has been more about abandoning the scalar reward hypothesis/paradigm even further, about building useful models of aligned intelligence which do not depend purely on the idea of reward maximisation. In that recent paper (mostly in section 7) I also develop some thoughts about why most ML researchers seem so interested in the problem of designing reward signals.

Comment by Koen.Holtman on Challenges with Breaking into MIRI-Style Research · 2022-01-20T12:19:33.955Z · LW · GW

Any thoughts on how to encourage a healthier dynamic.

I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum.

My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go.

So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main publication venue, but I do spend some energy cross-posting my work of type 2 here. I hope that it will inspire others, or at least counter-balance some of the type 1 work.

Comment by Koen.Holtman on Challenges with Breaking into MIRI-Style Research · 2022-01-19T17:13:11.685Z · LW · GW

I like your summary of the situation:

Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong.

This has also been my experience, at least on this forum. Much less so in academic-style papers about alignment. This has certain consequences for the problem of breaking into preparadigmatic alignment research.

Here are two ways to do preparadigmatic research:

1. Find something that is all wrong with somebody else's paradigm, then write about it.

MIRI-style preparadigmatic research, to the extent that it is published, read, and discussed on this forum, is almost all about the first of the above. Even on a forum as generally polite and thoughtful as this one, social media dynamics promote and reward the first activity much more than the second.

In science and engineering, people will usually try very hard to make progress by standing on the shoulders of others. The discourse on this forum, on the other hand, more often resembles that of a bunch of crabs in a bucket.

My conclusion is of course that if you want to break into preparadigmatic research, then you are going about it all wrong if your approach is to try to engage more with MIRI, or to maximise engagement scores on this forum.

Comment by Koen.Holtman on An Open Philanthropy grant proposal: Causal representation learning of human preferences · 2022-01-11T13:43:14.734Z · LW · GW

In other words, human preferences have a causal structure, can we learn its concepts and their causal relations?

[...]

Since I am not aware of anyone trying to use these techniques in AI Safety

I am not fully sure what particular sub-problem you propose to address (causal learning of the human reward function? Something different?), but some references to recent work you may find interesting:

Two recent workshops at NeurIPS 2021:

I have not read any of the papers in the above workshops yet, but I mention them as likely sources of places where people would discuss the latest status of the type of techniques you may be considering. Work I have written/read:

Much about the above work is about hand-constructing a type of machine reasoning which is more close to human values, but the artefacts being hand-constructed might conceivably also be constructed by an ML process that leverages a certain training set in a certain way.

Comment by Koen.Holtman on My Overview of the AI Alignment Landscape: A Bird's Eye View · 2022-01-10T11:57:57.852Z · LW · GW

Thanks, yes that new phrasing is better.

Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom's Superintelligence, and it also appears under the phrasing 'problem of control' in Russell's Human Compatible.

In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.

I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems.

My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.

I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.

Comment by Koen.Holtman on My Overview of the AI Alignment Landscape: A Bird's Eye View · 2022-01-08T15:46:41.924Z · LW · GW

Thanks for posting this writeup, overall this reads very well, and it should be useful to newcomers. The threat models section is both compact and fairly comprehensive.

I have a comment on the agendas to build safe AGI section however. In the section you write

I focus on three agendas I consider most prominent

When I finished reading the list of three agendas in it, my first thought was 'Why does this not mention other prominent agendas like corrigibility? This list is hardly is a birds-eye overview mentioning all prominent agendas to build safe AI.'

The three proposals I discuss here are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about.

which is quite different. My feeling is that your Google document description of what you are doing here in scoping is much more accurate and helpful to the reader than the 'most prominent' you use above.

Comment by Koen.Holtman on $1000 USD prize - Circular Dependency of Counterfactuals · 2022-01-07T13:43:44.348Z · LW · GW Not aware of which part would be a Wittgenstenian quote. Long time ago that I read Wittgenstein, and I read him in German. In any case, I remain confused on what you mean with 'circular'. Comment by Koen.Holtman on$1000 USD prize - Circular Dependency of Counterfactuals · 2022-01-07T13:10:11.228Z · LW · GW

Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.

So I think I am still struggling to see what flavour of philosophical thought you want people to engage with, when you mention 'circular'.

Mind you, I see 'hitting a point where we provide no justification at all' as a positive thing in a mathematical system, a physical theory, or an entire epistemology, as long as these points are clearly identified.

Comment by Koen.Holtman on $1000 USD prize - Circular Dependency of Counterfactuals · 2022-01-07T12:47:41.432Z · LW · GW OK thanks for explaining. See my other recent reply for more thoughts about this. Comment by Koen.Holtman on$1000 USD prize - Circular Dependency of Counterfactuals · 2022-01-06T14:53:26.599Z · LW · GW

It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false

OK, so if I understand you correctly, you posit that there is something called 'circular epistemology'. You said in the earlier post you link to at the top:

You might think that the circularity is a problem, but circular epistemology turns out to be viable (see Eliezer's Where Recursive Justification Hits Bottom). And while circular reasoning is less than ideal, if the comparative is eventually hitting a point where we can provide no justification at all, then circular justification might not seem so bad after all.

You further suspect that circular epistemology might have something useful to say about counterfactuals, in terms of offering a justification for them without 'hitting a point where we can provide no justification at all'. And you have a bounty for people writing more about this.

Am I understanding you correctly?

Comment by Koen.Holtman on 1000 USD prize - Circular Dependency of Counterfactuals · 2022-01-06T14:36:58.375Z · LW · GW Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think ??? I don't follow. You meant to write "use system X instead of using system Y which calls itself a definition of counterfactuals "? Comment by Koen.Holtman on More Is Different for AI · 2022-01-06T13:41:32.029Z · LW · GW Interesting. Reading the different paragraphs I am somewhat confused on how you classify thought experiments: part of engineering, part of philosophy, or third thing by itself? I'd be curious to see you expand on following question: if we treat thought experiments as not being a philosophical technique, what other techniques or insights does philosophy have to offer to alignment? Another comment: you write When thinking about safety risks from ML, there are two common approaches, which I'll call the Engineering approach and the Philosophy approach. My recent critique here (and I expand on this in the full paper) is that the x-risk community is not in fact using a broad engineering approach to ML safety at all. What is commonly used instead is a much more narrow ML research approach, the approach which sees every ML safety problem as a potential ML research problem. On the engineering side, things need to get much more multi-disciplinary. Comment by Koen.Holtman on1000 USD prize - Circular Dependency of Counterfactuals · 2022-01-06T12:37:15.647Z · LW · GW

Some people have asked why the Bayesian Network approach suggested by Judea Pearl is insufficient (including in the comments below). This approach is firmly rooted in Causal Decision Theory (CDT). Most people on LW have rejected CDT because of its failure to handle Newcomb's Problem.

I'll make a counter-claim and say that most people on LW in fact have rejected the use of Newcomb's Problem as a test that will say something useful about decision theories.

That being said, there is definitely a sub-community which believes deeply in the relevance of Newcomb's Problem as a test. This sub-community has historically created, and is still creating, a lot of traffic on this forum. This is to be expected: the people who reject Newcomb's Problem do not tend to post about it that much.

Personally, I reject Newcomb's Problem as a test.

I am also among the crowd who have posted explanations of Pearl Causality and Counterfactuals. My explanation here highlights the 'using a different world model' interpretation of Pearl's counterfactual math, so it may in fact touch on your reframing:

Or reframing this, counterfactuals only make sense from a cognitive frame.

I guess I'd roughly describe [a cognitive frame] as something that forms models of the world.

Overall, reading the post and the comment section, I feel that, if I reject Newcomb's Problem as a test, I can only ever write things that will not meet your prize criterion of usefully engaging with 'circular dependency'.

I have a sense that with 'circular dependency' you are also pointing to a broader class of philosophical problems of 'what does it mean for something to be true or correctly inferred'. If these were spelled out in detail, I also believe that I would end up rejecting the notion that we need to solve all these open problems definitively, the notion that these problems represent gaps in an agent foundations framework that still need to be filled, if the framework is to support AGI safety/alignment.

Comment by Koen.Holtman on Demanding and Designing Aligned Cognitive Architectures · 2021-12-23T14:26:29.513Z · LW · GW

Not entirely sure what you mean with your aside on 'unsupervised predictive reward'. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal?

But on your observation that 'the learned model will compensate for distortions': this sounds familiar. Here is a discussion.

# Intuition pumps and inner alignment failure

It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen, as an emergent phenomenon in any advanced RL or AGI system. Inner alignment failure has become a popular term on this forum when talking about this fear.

But this teleological intuition pump that 'life reward maximization, uh, finds a way' is deeply incorrect in the general case. Especially for the case of agents which are designed not to be pure reward maximizers.

I have more discussion about this in the paper, where I show how you can approach figure 4 with very different intuition pumps about using incorrect maps of territories. These intuition pumps that will tell you much more clearly how and why this works. For the mathematically inclined, I also include references to hard mathematical work, which should of course take precedence over intuition.

# Steel-manning an inner alignment failure argument

That being said, I'll now provide provide some failure mode analysis to show that 'the learned model will compensate for distortions' could conceivably happen, if no care is taken at all to prevent it.

There are two world models in the cognitive architecture of figure 4, a blue one and a green one. The green one drives the agent's decisions. The goal of the architecture is to ensure that this green word model driving the reward-optimizing decisions is specifically incorrect.

To counteract that goal, via the failure mode of 'the learned model will compensate for distortions', we have to imagine the following. We have to imagine that ML algorithm in the figure will end up introducing specific pre-distortions into the blue model, making the blue world model in fact an incorrect world model in a way that exactly anticipates and reverses the model edits that will be applied to it later, producing a green world model that is a fully correct world model.

To make this rather complicated path to failure work,

1. We have to posit that the model editing operation has certain properties that allow it to be attacked by a distortion technique in the first place. Not all model editing algorithms can be attacked in this way. An editing algorithm may work by just overwriting or eliminating exactly those parts of a world model where adding the pre-distortions would make for a successful attack. Or it may use a loop where it keeps adjusting pre-distorted model parameters, for as long as needed until some desirability criterion is met. (This is actually becoming a thing in mainstream DL research on computational fairness, judging by some recent NeurIPS papers.)

2. We also have to posit that there is some pretty strong optimization pressure of a particular kind going on, there has to be a feedback loop that will work to introduce exactly the pre-distortions needed
in the first place. These distortions also have to be of a nature where they do not directly contradict past observations, because the ML algorithm is searching for a model that is compatible with past observations, so it applies strong optimisation pressure of its own kind. So this feedback loop has some rather fancy work to do, if it is to achieve exactly the distortions that are needed above.

The most straightforward way to eliminate or suppress a feedback loop achieving all of the above is to map the cognitive architecture in figure 4 to a model-based RL system, not a model-free one. For the model-free case, ensuring that the above feedback loop cannot emerge, under reasonable i.i.d. assumptions about the agent environment, can be more tricky.

If you want to get even more fancy steel-manning this, you can replace 2 above by an adversarial attack scenario where a dedicated attacker breaks reasonable i.i.d. assumptions. In that case all bets are off.

I feel that the inner optimization people on this forum are correct that, when it comes to x-risk, we need to care about the possible existence of feedback loops as under 2 above. But I also feel that they are too pessimistic about our ability, or the ability of mainstream ML research for that matter, to use mathematical tools and simulation experiments to identify, avoid, or suppress these loops in a tractable way.

Comment by Koen.Holtman on Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship · 2021-12-22T22:42:15.088Z · LW · GW

I'm especially interested in the analogy between AI alignment and democracy.

This is indeed a productive analogy. Sadly, on this forum, this analogy is used in 99% of the cases to generate AI alignment failure mode stories, whereas I am much more interested in using it to generate useful ideas about AI safety mechanisms.

You may be interested in my recent paper 'demanding and designing', just announced here, where I show how to do the useful idea generating thing. I transfer some insights about aligning powerful governments and companies to the problem of aligning powerful AI.

Comment by Koen.Holtman on Consequentialism & corrigibility · 2021-12-22T17:57:50.396Z · LW · GW

Very open to feedback.

I have not read the whole comment section, so this feedback may already have been given, but...

I believe the “indifference” method represented some progress towards a corrigible utility-function-over-future-states, but not a complete solution (apparently it’s not reflectively consistent—i.e., if the off-switch breaks, it wouldn't fix it), and the problem remains open to this day.

Opinions differ on how open the problem remains. Definitely, going by the recent Yudkowsky sequences, MIRI still acts as if the problem is open, and seems to have given up on making progress on it, or believing that anybody else has made progress or can make progress. I on the other hand believe that the problem of figuring out how to make indifference methods work is largely closed. I have written papers on it, for example here. But you have told me before you have trouble reading my work, so I am not sure I can help you any further.

My impression is that, in these links, Yudkowsky is suggesting that powerful AGIs will purely have preferences over future states.

My impression is that Yudkowsky only cares about designing the type of powerful AGIs that will purely have preferences over future states. My impression is that he considers AGIs which do not purely have preferences over future states to be useless to any plan that might save the world from x-risk. In fact, he feels that these latter AGIs are not even worthy of the name AGI. At the same time, he worries that these consequentialist AGIs he wants will kill everybody, if some idiot gives them the wrong utility function.

This worry is of course entirely valid, so my own ideas about safe AGI designs tend to go heavily towards favouring designs that are not purely consequentialist AGIs. My feeling is that Yudkowsky does not want to go there, design-wise. He has locked himself into a box, and refuses to think outside of it, to the extent that he even believes that there is no outside.

As you mention above. if you want to construct a value function component that measures 'humans stay in control', this is very possible. But you will have to take into account that a whole school of thought on this forum will be all too willing to criticise your construction for not being 100.0000% reliable, for having real or imagined failure modes, for not being the philosophical breakthrough they really want to be reading about. This can give you a serious writer's block, if you are not careful.

Comment by Koen.Holtman on Demanding and Designing Aligned Cognitive Architectures · 2021-12-22T15:27:33.553Z · LW · GW

Thanks!

I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand.

But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right?

Yes I am disagreeing, of sorts. I would disagree with the statement that

| AI alignment research is a subset of AI research

but I agree with the statement that

| Some parts of AI alignment research are a subset of AI research.

As argued in detail in the paper, I feel that fields outside of AI research have a lot to contribute to AI alignment, especially when it comes to correctly shaping and managing the effects of actually deployed AI systems on society. Applied AI alignment is a true multi-disciplinary problem.

# On bravery debates in alignment

But there are definitely lots of people saying that AI alignment is part of the field of AI [...] How much would you say that this categorization is a bravery debate

In the paper I mostly talk about what each field has to contribute in expertise, but I believe there is definitely also a bravery debate angle here, in the game of 4-dimensional policy discussion chess.

I am going to use the bravery debate definition from here:

That’s what I mean by bravery debates. Discussions over who is bravely holding a nonconformist position in the face of persecution, and who is a coward defending the popular status quo and trying to silence dissenters.

I guess that a policy discussion devolving into a bravery debate is one of these 'many obstacles' which I mention above, one of the many general obstacles that stakeholders in policy discussions about AI alignment, global warming, etc will need to overcome.

From what I can see, as a European reading the Internet, the whole bravery debate anti-pattern seems to be very big right now in the US, and it has also popped up in discussions about ethical uses of AI. AI technologists have been cast as both cowardly defenders of the status quo, and as potentially brave nonconformists who just need to be woken up, or have already woken up.

There is a very interesting paper which has a lot to say about this part of the narrative flow: Better, Nicer, Clearer, Fairer: A Critical Assessment of the Movement for Ethical Artificial Intelligence and Machine Learning. One point the authors make is that it benefits the established business community to thrust the mantle of the brave non-conformist saviour onto the shoulders of the AI technologists.

Comment by Koen.Holtman on The Plan · 2021-12-15T22:39:59.913Z · LW · GW

I agree in general that pursuing multiple alternative alignment approaches (and using them all together to create higher levels of safety) is valuable. I am more optimistic than you that we can design control systems (different from time horizon based myopia) which will be stable and understandable even at higher levels of AGI competence.

it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster.

Well, if you worry about people fiddling with control system tuning parameters, you also need to worry about someone fiddling with value learning parameters so that the AGI will only learn the values of a single group of people who would like to rule the rest of the world. Assming that AGI is possible, I believe it is most likely that Bostrom's orthogonality hypothesis will hold for it. I am not optimistic about desiging an AGI system which is inherently fiddle-proof.

Comment by Koen.Holtman on The Plan · 2021-12-11T21:12:46.161Z · LW · GW

I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia).

Interesting observation on the above post! Though I do not read it explicitly in John's Plan, I guess you can indeed implicitly read that John's Plan rejects routes to alignment that focus on control/myopia, routes that do not visit step 2.of successfully solving automatic/ambitious value learning first.

John, can you confirm this?

Background: my own default Plan does focus on control/myopia. I feel that this line of attack for solving AGI alignment (if we ever get weak or strong AGI) is reaching the stage where all the major points of 'fundamental confusion' have been solved. So for me this approach represents the true 'easier strategy'.

Comment by Koen.Holtman on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment · 2021-12-11T17:02:54.224Z · LW · GW

OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety.

To go back to the Wikipedia quote:

The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design.

There are plenty of people in CS who are inspired by this thought: if we can design safer bridges by using the tool of mathematical analysis, why not software?

In modern ML, there are a lot of people who use mathematical analysis in an attempt to better understand or characterize the performance of ML algorithms. Such people have in fact been there since the start of AI as a field.

I am attending NeurIPS 2021, and there are lots of papers which are all about this use of formal methods, or have sections which are all about this. To pick a random example, there is 'Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss' (preprint here) where the abstract says things like

Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss.

(I have not actually read the paper. I am just using this as a random example.)

In general, my sense is that in the ML community right now, there is a broad feeling that mathematical analysis still needs to have some breakthrough before it can say truly useful things, truly useful things when it comes to safety engineering, for AIs that use deep neural networks.

If you read current ML research papers on 'provable guarantees' and then get very unexited, or even deeply upset at how many simplifying assumptions the authors are making in order to allow for any proofs at all, I will not blame you. My sense is also that different groups of ML researchers disagree about how likely such future mathematical breakthroughs will be.

As there is plenty of academic and commercial interest in this formal methods sub-field of ML, I am not currently working in it myself, and I do not follow it closely. As an independent researcher, I have the freedom to look elsewhere.

I am looking in a different formal methods corner, one where I abstract away from the details of any specific ML algorithm. I then try to create well-defined safety mechanisms which produce provable safety properties, properties which apply no matter what ML algorithm is used. This ties back into what I said earlier:

the design flow here is that alignment researchers spend these 2 years of time using formal methods to develop and publish AGI safety mechanisms right now, before anybody invents viable AGI.

# The list

So here is an annotated list of some formal methods based work which has, in my view, advanced the field of AGI safety. (To save me some time, I will not add any hyperlinks, pasting the title into google usually works.)

• Hutter's AIXI model, 'Universal algorithmic intelligence: A mathematical top down approach' models superintelligence in a usefully universal way.

• I need to point out however that the age-old optimal-policy concept in MDPs also models superintelligence in a usefully universal way, but at a higher level of abstraction.

• MIRI's 2015 paper 'Corrigibility' makes useful contributions to defining provable safety properties, though it then also reports failure in the effort of attempting to design an agent that provably satisfies them. At a history-of-ideas level, what it interesting here is that 2015 MIRI seemed to have been very much into using formal methods to make progress on alignment, but 2021 MIRI seems to feel that there is much less promise to the approach. Or at least, going by recently posted discussions, 2021 Yudkowsky seems to have lost all hope that these methods will end up saving the world. My own opinion is much more like 2015 MIRI than it is like 2021 MIRI. Modulo of course the idea that people meditating on HPMOR is a uniquely promising approach to solving alignment. I have never much felt he appeal of that approach myself. I think you have argued recently that 2021 Yudkowsky has also become less enthusiastic about the idea.

• The above 2015 MIRI paper owes a lot to Armstrong's 2015 'Motivated value selection for artificial agents', the first official academic-style paper about Armstrong's indifference methods. This paper arguably gets a bit closer to describing an agent design that provably meets the safety properties in the MIRI paper. The paper gets closer if we squint just right and interpret Armstrong's 2005 somewhat ambiguous conditional probability notation for defining counterfactuals in just the right way.

• My 2019 paper 'Corrigibility with Utility Preservation' in fact can be read to show how you need to squint in just the right way if you want to use Armstrong's indifference methods to create provable corrigibility related safety properties. The paper also clarifies a lot of questions about agent self-knowledge and emerging self-preservation incentives. Especially because of this second theme, the level of math in this paper tends to be more than what most people can or want to deal with.

• In my 2020 'AGI Agent Safety by Iteratively Improving the Utility Function' I use easier and more conventional MDP math to define a corrigible (for a definition of corrigibility) agent using Armstrong's indifference methods as a basis. This paper also shows some results and design ideas going beyond stop buttons and indifference methods.

• In my 2021 'Counterfactual Planning in AGI Systems' I again expand on some of these themes, developing a different mathematical lens which I feel clarifies several much broader issues, and makes the math more accessible.

• Orseau and Armstrong's 2016 'Safely interruptible agents' is often the go-to cite when people want to cite Armstrong's indifference methods. In fact it formulates and proves (for some ML systems) a related 'safely interruptable' safety property, a property which we also want to have if we apply a design like indifference methods.

• More recent work which expands on safe interruptability is the 2021 'How RL Agents Behave When Their Actions Are Modified' by Langlois and Everitt.

• Everitt and Hutter's 2016 'Self-modification of policy and utility function in rational agents' also deals with agent self-knowledge and emerging self-preservation incentives, but like my above 2019 paper which also deals with the topic, the math tends to be a bit too much for many people.

• Carey, Everitt, et al. 2021 'Agent Incentives: A Causal Perspective' has (the latest version of) a definition of causal influence diagrams and some safety properties we might define by using them. The careful mathematical definitions provided in the paper use Pearl's style of defining causal graphs, which is again too much for some people. I have tried to offer a more accessible introduction and definitional framework here.

• Everitt et al's 2019 'Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective' uses the tools defined above to take a broader look at various safety mechanisms and their implied safety properties.

• The 2020 'Asymptotically unambitious artificial general intelligence' by Cohen et al designs a generic system where a uncertain-about-goals agent can call a mentor, and then proves some bounds for it.

• I also need to mention the 2017 'The Off-Switch Game' by Hadfield-Menell et al. Like the 2015 MIRI paper on corrigibility, this paper ends up noting that it has clarified some problems, but without solving all of them.

• There is lots of other work, but this should give you an idea of what I mean with formal methods work that defines useful AGI safety properties in a mathematically precise way.

Overall, given the results in the above work, I feel that formal methods are well underway in clarifying how one might define useful AGI safety properties, and how we might built them into AGI agents, if AGI agents are ever developed.

But it is also clear from recent comments like this one that Yudkowsky feels that no progress has been made. Yudkowsky is not alone in having this opinion.

I feel quite the opposite, but there is only a limited amount of time that I am willing to invest in convincing Yudkowsky, or any other participants on this forum, that a lot of problems in alignment and embeddedness are much more tractable than they seem to believe.

In fact, in the last 6 months or so, I have sometimes felt that the intellectually interesting open purely mathematical problems have been solved to such an extent that I need to expand the scope of my research hobby. Sure, formal methods research on characterizing and improving specific ML algorithms could still use some breakthroughs, but this is not the kind of math that I am interested in working on as a self-funded independent researcher. So I have recently become more interested in advancing the policy parts of the alignment problem. As a preview, see my paper here, for which I still need to write an alignment forum post.

Comment by Koen.Holtman on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment · 2021-12-08T12:13:54.601Z · LW · GW

"we're going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems".

I disagree that AI code is orders of magnitude more complex than say the code in a web browser or modern compiler: in fact quite the opposite applies. Most modern ML algorithms are very short pieces of code. If you are willing to use somewhat abstract math where you do not write out all the hyperparameter values, you can specify everything that goes on in a deep learning algorithm in only a few lines of pseudocode. Same goes for most modern RL algorithms.

I also note that while modern airplanes contain millions of lines of code, most of it is in the on-board entertainment systems. But the safety-critical subsystems in airplanes them tend to be significantly smaller in code size, and they also run air-gapped from the on-board entertainment system code. This air-gapping of course plays an important role in making the formal proofs for the safety-critical subsystems possible.

But beyond these observations: the main point I am trying to get across is that I do not value formal methods as a post-hoc verification tool that should be applied to millions of lines of code, some of it spaghetti code put together via trial and error. Clearly that approach would not work very well.

I value formal methods as a tool for the aligned AI code specification and design phases.

On the design phase: formal methods offer me a way to create many alternative and clarifying viewpoints on what the code is doing or intending to do, viewpoints not offered by the text of the code itself. For example, formally expressed loop invariants can express much more clearly what is going on in a loop than the loop code itself. Global invariants that I can formulate for distributed system state can allow me to express more clearly how the protocols I design manage to avoid deadlock than the code itself. So the main value if the hand-in-hand method, as a code design method, is that you can develop these clarifying mathematical viewpoints, even before you start writing the code. (Not all programmers are necessarily good at using formal methods to clarify what they are doing, or to accelerate their work. I have heard a formal tool vendor say that formal methods based tools only make the top 10% of programmers in any development organisation more productive.)

On the specification phase: formal methods can allow me to express safety properties I seek for AGI systems in a much more clear and precise way than any piece of code can, or any natural language statement can.

If we had such formal methods backed approaches, I could see DeepMind potentially using it, maybe OpenAI, but not chinese labs or all the competitors we don't know about. And the fact that it would slow down whoever uses it for 2 years would be a strong incentive for even the aligned places to not use it

No, the the design flow here is that alignment researchers spend these 2 years of time using formal methods to develop and publish AGI safety mechanisms right now, before anybody invents viable AGI. The designs we develop should then preferably take far less than 2 years, and no deep mathematical skills, to combine with an AGI-level ML algorithm, if one ever invents one.

OK, I really need to post the promised list of list of alignment papers were formal methods are actually used in the clarifying design-tool way I describe above. Talking about it in the abstract is probably not the best way to get my points across. (I am in the middle of attending virtual NeurIPS, so I have not found the time yet to compile the promised list of papers.)

BTW, bit of a side issue: I do not have an x-risk fault tree model where I assume that a 'chinese lab' is automatically going to be less careful than a 'western lab', But as a rhetorical shorthand I see what you are saying.

Comment by Koen.Holtman on Stop button: towards a causal solution · 2021-12-06T21:40:47.525Z · LW · GW

I just ran across this post and your more recent post. (There are long time intervals when I do not read LW/AF.) Some quick thoughts/pointers:

I haven't properly followed AI safety for a while. I don't know if this idea is original.

On the surface, this looks very similar to Armstrong's indifference methods and/or Counterfactual Planning. Maybe also related to LCDT.

However, I cannot really tell how similar it is to any of these, because I can't follow your natural language+mathematical descriptions of how exactly your construct your intended counterfactuals. When I mean 'can't follow ', I mean I do not understand your description of your design enough that I could actually implement your method in a simple ML system in a simple toy world.

This lack of exhaustive detail is a more common problem when counterfactuals are discussed here, but it can be solved. If you want to get some examples of how to write detailed and exhaustive explanations of counterfactual stop button designs and how they relate to machine learning, see here and here.

Comment by Koen.Holtman on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment · 2021-12-06T15:53:46.001Z · LW · GW

I think I will need to write two separate replies to address the points you raise. First, a more meta-level and autobiographical reply.

When it comes to formal methods, I too have a good idea of what I am talking about. I did a lot of formal methods stuff in the 1990 at Eindhoven University, which at the time was a one of the places with the most formal methods in the Netherlands.

I also see that you're saying formal methods help you design stuff. That sounds very wrong to me.

When I talk about using formal methods to design stuff, I am very serious. But I guess this is not the role of formal methods you are used to, so I'll elaborate.

The type of formal methods use I am referring to was popular in the late 1970s to at least the 1990s, not sure how popular it is now. It can be summarized by Dijkstra's slogan of “designing proof and program hand in hand”.

This approach stands in complete opposite to the approach of using a theorem prover to verify existing code after it was written.

In simple cases, the designing hand-in-hand approach works as follows: you start with a mathematical specification of the properties you want the program to be written to have. Then you use this specification to guide both the writing of the code and the writing of the correctness proof for the code at the same time. The writing of the next line in the proof will often tell you exactly what next lines of code you need to add. This often leads to code which much more clearly expresses what is going on. The whole code and proof writing process can often be done without even leveraging a theorem prover.

In more complex cases, you first have to develop the mathematical language to write the specification in. These complex cases are of course the more fun and interesting cases. AGI safety is one such more complex case.

You mention distributed computing. This is one area where the hand-in-hand style of formal methods use is particularly useful, because the method of intuitive trial and error sucks so much at writing correct distributed code. Hand-in-hand methods are useful if you want to design distributed systems protocols that are proven to be deadlock-free, or have related useful properties. (The Owicki-Gries method was a popular generic approach for this type of thing in the 1990, not sure if it is still mentioned often.)

In the late 1990s, I considered the career option of joining academia full-time to work on formal methods. But in the end, I felt that the academic field was insufficiently interested in scaling up beyond toy problems, while there was also insufficient demand-pull from industry or industry regulators. (This never advancing beyond toy problems is course a common complaint about academic CS, it does not usually pay in academia to work on problems larger than the page limit of a conference paper.) In any case, in the late 1990s I abandoned the idea of a career in the formal methods field. Instead, I have done various other things in internet engineering, big science, and industry R&D. I do feel that in all these cases, my formal methods background often often helped me design stuff better, and write more clear system specifications.

The reason I got into independent AGI safety research a few years ago is because a) after working in industry for a long time, I had enough money that I could spend some time doing whatever I liked, including writing papers without caring about a page limit, and b) (more very relevant to this discussion) after scanning various fun open research problems, I felt that the open problems in AGI safety were particularly tractable to the formal methods approach. Specifically, to the approach where you construct programs and proofs hand in hand without even bothering with theorem provers.

So compared to you, I have the exact opposite opinion on about the applicability of formal methods to AGI safety.

I intend to back up this opinion by giving you a list of various formal methods based work which has advanced the field of AGI safety, but I'll do that in a later reply.

And providing a specification for alignment would be on par with proving P != NP. Not saying it's impossible, just that it's clearly so hard that you nee to show a lot of evidence to convince people that you did. (I'm interested if you feel you have evidence for that ^^)

The key to progress is that you specify various useful 'safety properties' only, which you then aim to construct/prove. Small steps! It would be pretty useless to try to specify the sum total of all current and future human goals and morality in one single step.

Sure, you can use formal methods if you want to test your spec, but it's so costly with modern software that basically no one considers it worth it except cases where there's a risk of death (think stuff like airplanes).

This is of course the usual Very American complaint about formal methods.

However, if I compare AGI to airplanes, I can note that, compared to airplaine disasters, potential AGI disasters have an even bigger worst-case risk of producing mass death. So by implication, everybody should consider it worth the expense to apply formal methods to AGI risk management. In this context, 2 person years to construct a useful safety proof should be considered money well spent.

Finding the money is a somewhat separate problem.

And the big problem with AGI is that a lot of the people running for it don't understand that it involves risks of death and worse.

I don't agree with the picture you are painting here. The companies most obviously running for AGI, DeepMind and OpenAI, have explicit safety teams inside of them and they say all the right things about alignment risks on their websites. Also, there is significant formal methods based work coming out of DeepMind, from people like Hutter and Everitt.

Comment by Koen.Holtman on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment · 2021-12-05T16:16:41.303Z · LW · GW

Nice overview, but what surprises me is that you are not in fact describing the main epistemic strategy used in engineering. What you say about engineering is:

[...] tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.

You fail to mention the more important engineering strategy: one which does not rely on tinkering, but instead on logical reasoning and math to chart a straight line to your goal.

To use the obvious example, modern engineering does not design bridges by using the staple strategy of tinkering, it will use applied math and materials science to create and validate the bridge design.

From this point of view, the main difference between 'science' and 'engineering' is that science tries to understand nature: it seeks to understand existing physical laws and evolved systems. But engineering builds systems from the ground up. The whole point of this is that the engineer can route around creating the kind of hard-to-analyse complexity you often encounter in evolved systems. Engineering is a 'constructive science' not an 'observational science' or 'experimental science'.

On this forum, you can find a lot of reflection and debate about 'the fundamental properties of AGI', debate which treats AGI as some kind of evolved entity, not as a designed entity. I feel that this approach of treating AGI as an evolved entity is too limited. It may be a useful framing if you want to prove that the alignment problem exists, but it will not help you much in solving it.

In software engineering, the 'tinkering' approach to programming is often described as the approach of 'debugging a blank piece of paper'. This kind-of-experimental approach is not the recommended or respectable approach, especially not for high-risk software systems. (That being said, in ML algorithm research, improving systems by blindly tinkering with them still seems to be fairly respectable.)

Like you, I like Theoretical Computer Science. However, I do not like or recommend complexity theory as a model of TCS. As complexity theory mainly showcases the analytical post-hoc use of math, and not the constructive use of math to guide your design decisions. The type of TCS I like is what Wikipedia calls formal methods:

The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design.

Specifically, in AI/AGI alignment, we can trivially use formal methods to model superintelligent machine learning, and then proceed to use formal methods to construct agents that have certain provable safety properties, even though they incorporate superintelligent machine learning.

Stuart Russell has remarked here that in the US, there seems to be somewhat of a cultural aversion against using formal methods:

Intel’s Pentium chip was tested with billions of examples of multiplication, but it failed to uncover a bug in the multiplication circuitry, which caused it to produce incorrect results in some cases. And so we have a technology of formal verification, which would have uncovered that error, but particularly in the US there’s a culture that’s somewhat opposed to using formal verification in software design.

Less so in hardware design nowadays, partly because of the Pentium error, but still in software, formal verification is considered very difficult and very European and not something we do.

I'd like to emphasize the use of a constructive engineering mindset, and the use of formal methods from TCS, as important epistemic strategies applicable to AGI alignment.

Comment by Koen.Holtman on Solve Corrigibility Week · 2021-12-05T14:13:42.430Z · LW · GW

Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.

You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it.

On the last one of your three examples, I feel that 'mesa optimizers' is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people's definitions or models as a frame of reference. They'd rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about.

I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here.

Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of debate.

This has implications for policy-related alignment work. If you want to make a policy proposal that has a chance of being accepted, it is generally required that you can point to some community of subject matter experts who agree on the coherence and effectiveness of your proposal. LW/AF cannot serve as such a community of experts.

Comment by Koen.Holtman on Soares, Tallinn, and Yudkowsky discuss AGI cognition · 2021-12-02T13:38:24.207Z · LW · GW

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.

In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.

Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.

Comment by Koen.Holtman on Soares, Tallinn, and Yudkowsky discuss AGI cognition · 2021-12-02T13:08:15.097Z · LW · GW

Of course there has been lots of 'obvious output of this kind from the rest of the "AI safety" field'. It is not like people have been quiet about convergent instrumental goals. So what is going on here?

I read this line (and the paragraphs that follow it) as Eliezer talking smack about all other AI safety researchers. As observed by Paul here:

Eliezer frequently talks smack about how the real world is surprising to fools like Paul

I liked some of Eliezer's earlier, more thoughtful writing better.

Comment by Koen.Holtman on Solve Corrigibility Week · 2021-12-02T11:03:16.267Z · LW · GW

Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.

Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.

but maybe I'm missing something?

The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it.

To make the same point in another way: the forces which introduce disagreeing viewpoints and linguistic entropy to this forum are stronger than the forces that push towards agreement and clarity.

My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations and also this one. In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.

Comment by Koen.Holtman on Solve Corrigibility Week · 2021-11-30T11:02:17.325Z · LW · GW

I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.

I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level observation on solving corrigibility is this: the community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved.

This is what you get when a site is in part a philosophy-themed website/forum/blogging platform. In philosophy, problems are never solved to the satisfaction of the community of all philosophers. This is not necessarily a bad thing. But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem, has been solved.

In business, there is the useful terminology that certain meetings will be run as 'decision making meetings', e.g. to make a go/no-go decision on launching a certain product design, even though a degree of uncertainty remains. Other meetings are exploratory meetings only, and are labelled as such. This forum is not a decision making forum.

Comment by Koen.Holtman on How To Get Into Independent Research On Alignment/Agency · 2021-11-28T12:06:26.541Z · LW · GW

I'm aware that a lot of AI Safety research is already of questionable quality. So my question is: how can I determine as quickly as possible whether I'm cut out for this?

My key comment here is that, to be an independent researcher, you will have to rely day-by-day on your own judgement on what has quality and what is valuable. So do you think you have such judgement and could develop it further?

To find out, I suggest you skim a bunch of alignment research agendas, or research overviews like this one, and then read some abstracts/first pages of papers mentioned in there, while trying apply your personal, somewhat intuitive judgement to decide

• which agenda item/approach looks most promising to you as an actual method for improving alignment

• which agenda item/approach you feel you could contribute most to, based on your own skills.

If your personal intuitive judgement tells you nothing about the above questions, if it all looks the same to you, then you are probably not cut out to be an independent alignment researcher.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-25T19:07:34.097Z · LW · GW

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?

Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.

What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'.

Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.

Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.

Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.

this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation between some well-defined observables, and this relation is definitely not Q'.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-24T10:32:17.522Z · LW · GW

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function that values the maximization of staples instead.

To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

Comment by Koen.Holtman on How To Get Into Independent Research On Alignment/Agency · 2021-11-23T14:36:20.433Z · LW · GW

As nobody else has mentioned it yet in this comment section: AI Safety Support is a resource-hub specifically set up to help people get into alignment research field.

I am a 50 year old independent alignment researcher. I guess I need to mention for the record that I never read the sequences, and do not plan to. The piece of Yudkowsky writing that I'd recommend everybody interested in alignment should read is Corrigibilty. But in general: read broadly, and also beyond this forum.

I agree with John's observation that some parts of alignment research are especially well-suited to independent researchers, because they are about coming up with new frames/approaches/models/paradigms/etc.

But I would like to add a word of warning. Here are two somewhat equally valid ways to interpret LessWrong/Alignment Forum:

1. It is a very big tent that welcomes every new idea

2. It is a social media hang-out for AI alignment researchers who prefer to engage with particular alignment sub-problems and particular styles of doing alignment research only.

So while I agree with John's call for more independent researchers developing good new ideas, I need to warn you that your good new ideas may not automatically trigger a lot of interest or feedback on this forum. Don't tie your sense of self-worth too strongly to this forum.

On avoiding bullshit: discussion on this forum are often a lot better than on some other social media sites, but still Sturgeon's law applies.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-22T16:02:44.757Z · LW · GW

10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

Not sure if a short answer will help, so I will write a long one.

In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument . This is possible for example by using together with a (learned) model of the compute core to predict : so a viable could be defined as . This could make predictions fully compatible with the observational record , but I claim it would not be a reasonable learned according to the reasonableness criterion . How so?

The reasonableness criterion is similar to that used in supervised machine learning: we evaluate the learned not primarily by how it matches the training set (how well it predicts the observations in ), but by evaluating it on a separate test set. This test set can be constructed by sampling to create samples not contained in . Mathematically, perfect reasonableness is defined as , which implies that predicts all samples from fully accurately.

Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of , but another version can be used stand-alone to construct a test set.

A sampling action to construct a member of the test set would set up a desired state and action , and then observe the resulting . Mathematically speaking, this observation gives additional information about the numeric value of and of all for all .

I discuss in the section that, if we take an observational record sampled from , then two learned predictive functions and could be found which are both fully compatible with all observations in . So to determine which one might be a more reasonable approximation of , we can see how well they would each predict samples not yet in .

In the case of section 10.2.4, the crucial experimental test showing that is an unreasonable approximation of is one where we create a test set by setting up an and an where we know that is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state . So we set up a test where we expect that . will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that is a correct theory of .

As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record , the training set, to already contain observations where for any deterministic . So this will likely suppress the creation of an unwanted via machine learning.

Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI's work on embedded agency. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.

Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper's view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.

I would be interested to know if the above explanation was helpful to you, and if so which parts.

Comment by Koen.Holtman on Corrigibility Can Be VNM-Incoherent · 2021-11-21T18:25:53.690Z · LW · GW

but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty

If you do not know it already, this intuition lies at the heart of CIRL. So before you jump to coding, my recommendation is to read that paper first. You can find lots of discussion on this forum and elsewhere on why CIRL is not a perfect corrigibility solution. If I recall correctly, the paper itself also points out the limitation I feel is most fundamental: if uncertainty is reduced based on further learning, CIRL-based corrigibility is also reduced.

There are many approaches to corrigibility that do not rely on the concept of reward uncertainty, e.g. counterfactual planning and Armstrong's indifference methods.

Comment by Koen.Holtman on Corrigibility Can Be VNM-Incoherent · 2021-11-21T18:14:31.279Z · LW · GW

(I already commented on parts of this post in this comment elsewhere, the first and fourth paragaph below copy text from there.)

My first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuituon-pump, in a meaning where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

While are using math to disambiguate some properties of corrigibility above (yay!), you are not necessarily disambiguating Eliezer.

Maybe I am reading your post wrong: I am reading it as an effort to apply the axioms of VNM-rationality to define a notion you call VNM-incoherence. But maybe VN and M defined a notion of coherence not related to their rationality axioms. a version of coherence I cannot find on the Wikipedia page -- if so please tell me.

I am having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can also examine/score the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case.

When it comes to a multi-time-step agent, I guess there are two ways to interpret the notion of 'outcome' in VNM theory: the outcome is either the system state obtained after the last time step, or the entire observable trajectory of events over all time steps.

As for what you prove above, I would phrase the statement being proven as follows. If you want to force a utility-maximising agent to adopt a corrigible policy by defining its utility function, then it is not always sufficient to define a utility function that evaluates the final state along its trajectory only. The counter-example given shows that, if you only reference the final state, you cannot construct a utility function that will score and differently.

The corollary is: if you want to create a certain type of corrigibility via terms you add to the utility function of a utility-maximising agent, you will often need to define a utility function that evaluates the entire trajectory, maybe including the specific actions taken, not just the end state. The default model of an MDP reward function, the one where the function is applied to each state transition along the trajectory, will usually let you do that. You mention:

I don't think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.

I'd claim that you have proven that you actually might need such hacky prohibitions to solve corrigibility in the general case.

To echo some of the remarks made by tailcalled: maybe this is not surprising, as human values are often as much about the journey as about the destination. This seems to apply to corrigibility. The human value that corrigibility expresses does not in fact express a preference ordering on the final states an agent will reach: on the contrary it expresses a preference ordering among the methods that the agent will use to get there.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-21T14:51:34.243Z · LW · GW

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I'll expand on this in a comment I plan to attach to your post.

I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.

Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong's indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).

Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-19T22:39:37.852Z · LW · GW

we can't pick out the compute core in the black-box learned model.

Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.

But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.

I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-19T21:07:21.009Z · LW · GW

See above for my reply to Eliezer.

Indeed, a counterfactual planner will plan coherently inside its planning world.

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function . Armstrong's indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms.

One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-19T20:38:57.570Z · LW · GW

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how,

in the least fancy actually-precise terms of which your writing talent permits?

If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.

TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.

I will now expand on this and add more details, using the the example of an emergency stop button.

Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform actions in every future time step. Let's say that is a world model that accurately depicts this situation. I am not going to build an AGI that uses to plan its actions.

Instead I build an AGI agent that will plan its next actions by using an incorrect world model . This is different from , but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by , the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action would maximize utility in this incorrect, imaginary world . I then further construct it to take this same action in the real world.

An AGI that lives in a world that matches the correct model , while using the incorrect model to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that . Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.

So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model . The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.

The next step is to look for options to actually construct an AGI that uses this incorrect .

1. One option is to train the AGI in a simulated environment that faithfully represents . We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent from being updated towards . Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.

2. The more interesting option is to build an AGI that will construct based on an online training regime in the real world, in a real world that exactly matches the correct world model .

Is the second option technically feasible? The answer is generally yes.

A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model , and (p) is the imaginary/incorrect world model . In the agent defined using these pictures, certain indifference properties are present trivially, by construction.

The solution in the pictures above uses ML to learn a model that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect . Note that this learned may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside for this slot-in operation to work.

(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)

I want to stress that this causal model option is only one possible route to creating incorrect world models via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.

So overall, I have a degree of optimism about AGI corrigibility.

That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.

Comment by Koen.Holtman on Ngo and Yudkowsky on alignment difficulty · 2021-11-18T18:20:24.542Z · LW · GW

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.

In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:

[...] corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

this is where things get somewhat personal for me:

[...] (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those "solutions" are things we considered and dismissed as trivially failing to scale to powerful agents - they didn't understand what we considered to be the first-order problems in the first place - rather than these being evidence that MIRI just didn't have smart-enough people at the workshop.)

I am one of `these people outside MIRI' who have published papers and sequences saying that they have solved large chunks of the AGI corrigibility problem.

I have never been claiming that I 'totally just solved corrigibility'. I am not sure where Eliezer is finding these 'totally solved' people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence, I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer's impoverished view of what an AGI-level intelligence is, or must be.

Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.

I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.

I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.

I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.

So this is why I call Eliezer's view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people's solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don't even need to examine the details of the proposed solution to draw that conclusion.

Comment by Koen.Holtman on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-09T11:21:41.588Z · LW · GW

Not really, unfortunately. In those posts [under the threat models tag], the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place.

I feel that Christiano's post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag.

There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you.

Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor.

However [...] it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.

Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.

Comment by Koen.Holtman on Safety-capabilities tradeoff dials are inevitable in AGI · 2021-10-08T14:34:41.627Z · LW · GW

Based on what you say above, I do not think we fundamentally disagree. There are orthogonal dimensions to safety mechanism design which are all important.

I somewhat singled out your line of 'the lower the better' because I felt that your taxation framing was too one-dimensional.

There is another matter: in US/UK political discourse, it common that if someone wants to prevent the government from doing something useful, this something will be framed as a tax, or as interfering with economic efficiency. If someone does want the government to actually do a thing, in fact spend lavishly on doing it, the same thing will often be framed as enforcement. This observation says something about the quality of the political discourse. But as a continental European, it is not the quality of the discourse I want to examine here, only the rhetorical implications.

When you frame your safety dials as taxation, then rhetorically you are somewhat shooting yourself in the foot, if you want proceed by arguing that these dials should not be thrown out of the discussion.

When re-framed as enforcement, the cost of using these safety dials suddenly does not sound as problematic anymore.

But enforcement, in a way that limits freedom of action, is indeed a burden to those at the receiving end, and if enforcement is too heavy they might seek to escape it altogether. I agree that perfectly inescapable watertight enforcement is practically nonexistent in this world, in fact I consider its non-existence to be more of a desirable feature of society than it is a bug.

But to use your terminology, the level of enforcement applied to something is just one of these tradeoff dials that stink. That does not mean we should throw out the dial.

Comment by Koen.Holtman on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-08T13:02:51.282Z · LW · GW

Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal's suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.

Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of 'objective robustness' vs 'capability robustness' while avoiding the problem of trying to define a single meaning for the term 'inner alignment'. Seems like good progress to me.

Comment by Koen.Holtman on Safety-capabilities tradeoff dials are inevitable in AGI · 2021-10-08T11:46:46.628Z · LW · GW

I agree with your central observation that Safety-capabilities tradeoff dials are inevitable in AGI. It is useless to search only a safety mechanism that all purely selfish owner of AGIs will use voluntarily, at a safe setting, even when such selfish owners are engaged in an arms race.

However, I disagree with another point you make:

Then the next question to ask is: if the alignment tax is less than infinity, that’s a good start, but just how high or low is the tax? There’s no right answer anymore: it’s just “The lower the better”.

The right answer is definitely not 'the lower the better'. I'd say that your are framing cost of the alignment tax as 'the tax is higher if it gets more in the way of selfish peoples' desire to fully pursue their selfish ends' . In this, you are following the usual 'alignment mechanisms need to be commercially viable' framing on this forum. But with this limited framing, lower alignment taxes will not necessarily imply better outcomes for all of society.

What you need to do is to examine the blank in your third bullet point above more carefully, the blank line you expect the AGI strategy / governance folks to hopefully fill in. What these AGI strategy / governance folks are looking for is clear: they want to prevent destructive arms races, and races-to-the-bottom-of-safety, from happening between somewhat selfish actors, where these actors may be humans or organisations run by humans. This prevention can only happen is there is a countervailing force to all this selfishness.

(The alternative to managing a countervailing force would be to put every selfish human in a re-education camp, and hope that it works. Not the kind of solution that the strategy/governance folks are looking for, but I am sure there is an SF novel somewhere about a re-education camp run by an AGI. Another alternative would be to abandon the moral justification for the market system and markets altogether, abandon the idea of having any system of resource allocation where self-interested actions can be leveraged to promote the common good. Again, not really what most governance folks are looking for,)

Strategy/governance folks are not necessarily looking for low-cost dials as you define them, they are looking for AGI safety dials that they can all force selfish actors to use, as part of the law or of a social contract (where a social contract can be between people but also between governments or government-shaped entities). To enforce the use of a certain dial at a certain setting, one must be able to detect if an actor is breaking the law or social contract by not using the dial, so that sanctions can be applied. The easier and more robust this detection process is, the more useful the dial is.

The cost of being forced to use the dial to the selfish actors is only a secondary consideration. The less self-aware of these selfish actors can be relied on to always complain about being bound to any law or social contract whatsoever.

Comment by Koen.Holtman on A brief review of the reasons multi-objective RL could be important in AI Safety Research · 2021-10-07T10:50:22.491Z · LW · GW

Thanks for writing this up! I support your call for more alignment research that looks more deeply at the structure of the objective/reward function. In general I feel that the reward function part of the alignment problem/solution space could use much more attention, especially because I do not expect traditional ML research community to look there.

Traditional basic ML research tends to abstract away from the problem of writing am aligned reward function: it all about investigating improvements to general-purpose machine learning, machine learning that can optimize for any possible 'black box' reward function .

In the work you did, you show that this black box view of the reward function is too narrow. Once you open up the black box and treat the reward function as a vector, you can define additional criteria about how machine learning performance can be aligned or unaligned.

In general, I found that once you take the leap and start contemplating reward function design, certain problems of AI alignment can become much more tractable. To give an example: the management of self-modification incentives in agents becomes kind of trivial if you can add terms to the reward function which read out some physical sensors, see for example section 5 of my paper here.

So I have been somewhat puzzled by the question of why there is so little alignment research in this direction, or why so few people step up an point out that this kind of stuff is trivial. Maybe this is because improving the reward function is not considered to be a part of ML research. If I try to manage self-modification incentives with my hands tied behind my back, without being allowed to install physical sensors coupled to reward function terms, the whole problem becomes much less tractable. Not completely intractable, but the the solutions I then find (see this earlier paper ) are mathematicaly much more complex, and less robust under mistakes of machine learning.

I sometimes have the suspicion that there are whole non-ML conferences or bodies of literature devoted to alignment related reward function design, but I am just not seeing them. Unfortunately, it looks like the modem2021 workshop website with the papers you linked to is currently down. It was working two weeks ago.

So a general literature search related question: while doing your project, did you encounter any interesting conferences or papers that I should be reading, if I want to read more work on aligned reward function design? I have already read Human-aligned artificial intelligence is a multiobjective problem.

Comment by Koen.Holtman on AI learns betrayal and how to avoid it · 2021-10-05T20:57:09.897Z · LW · GW

Looks interesting and ambitious! But I am sensing some methodological obstacles here, which I would like to point out and explore. You write:

Within those projects, I'm aiming to work on subprojects that are:

1. Posed in terms that are familiar to conventional ML;

2. interesting to solve from the conventional ML perspective;

3. and whose solutions can be extended to the big issues in AI safety.

Now, take the example of the capture the cube game from the Deepmind blog post. This is a game where player 1 tries to move a cube into the white zone. and player 2 tries to move it into the blue zone on the other end of the board. If the agents learn to betray each other here, how would you fix this?

We'll experiment with ways of motivating the agents to avoid betrayals, or getting anywhere near to them, and see if these ideas scale.

There are three approaches to motivating agents to avoid betrayals in capture the cube that I can see:

1. change the physical reality of the game: change the physics of the game world or the initial state of the game world

2. change the reward functions of the players

3. change the ML algorithms inside the players, so that they are no longer capable of finding the optimal betrayal-based strategy.

Your agenda says that you want to find solutions that are interesting from the conventional ML perspective. However, in the conventional ML perspective:

1. tweaking the physics of the toy environment to improve agent behavior is out of scope. It is close to cheating on the benchmark.

2. any consideration of reward function design is out of scope. Tweaking it to improve learned behavior is again close to cheating.

3. introducing damage into your ML algorithms so that they will no longer find the optimal policy is just plain weird, out of scope, and close to cheating.

So I'd argue that you have nowhere to move if you want to solve this problem while also pleasing conventional ML researchers. Conventional ML researchers will always respond by saying that your solution is trivial, problem-specific, and therefore uninteresting.

OK, maybe I am painting too much of a hard-core bitter lesson picture of conventional ML research here. I could make the above observations disappear by using a notion of conventional ML research that is more liberal in what it will treat as in-scope, instead of as cheating.

What I would personally find exiting would be a methodological approach where you experiment with 1) and 2) above, and ignore 3).

In the capture the cube game, you might experiment with reward functions that give more points for a fast capture followed by a fast move to a winning zone, which ends the game, and less for a slow one. If you also make this an iterated game (it may already be a de-facto iterated game depending on the ML setup), I would expect that you can produce robust collaborative behavior with this time-based reward function. The agents may learn to do the equivalent of flipping a coin at the start to decide who will win this time: they will implicitly evolve a social contract about sharing scarce resources.

You might also investigate a game scoring variant with different time discount factors, factors which more heavily or more lightly penalize wins which take longer to achieve. I would expect that with higher penalties for taking a longer time to win, collaborative behavior under differences between player intelligence and ability will remain more robust, because even a weaker player can always slow down a stronger player a bit if they want to. This penalty approach might then generalize to other types of games.

The kind of thing I have in mind above could also be explored in much more simple toy worlds than those offered by XLand. I have been thinking of a game where we drop two players on a barren planet, where one has the reward function to maximize paperclips, and one to maximize staples. If the number of paperclips and staples is discounted, e,g, the time-based reward functions are and , this might produce more collaborative/sharing behavior, and suppress a risky fight to capture total dominance over resources.

Potentially, some branch of game theory has already produced a whole body of knowledge that examines this type of approach to turning competitive games into collaborative games, and has come up with useful general results and design principles. Do not know. I sometimes wonder about embarking on a broad game theory literature search to find out. The methodological danger of using XLand to examine these game theoretical questions is that by spending months working in the lab, you will save hours in the library.

These general methodological issues have been on my mind recently. I have been wondering if AI alignment/safety researchers should spend less time with ML researchers and their worldview, and more time with game theory people.

I would be interested in your thoughts on these methodological issues, specifically your thoughts about how you will handle them in this particular subproject. One option I did not discuss above is transfer learning which primes the agents on collaborative games only, to then explore their behavior on competitive games.

Comment by Koen.Holtman on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-04T13:04:50.939Z · LW · GW

I also don't think [these three books] focus on surveying the range of arguments for alignment failure, but rather on presenting the author's particular view.

I disagree. In my reading. all of these books offer fairly wide-ranging surveys of alignment failure mechanisms.

A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice. Once we take it as axiomatic that some people are stupid some of the time, presenting a convincing proof that some AI alignment failure mode is theoretically possible does not require much heavy lifting at all.

If there are distilled collections of arguments with these properties, please let me know!

The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.

The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.

There is a technical failure mode story which says that it is very difficult to equip a very powerful future AI with an emergency stop button, that we have not solved that technical problem yet. In fact, this story is a somewhat successful meme in its own right: it appears in all 3 books I mentioned. That story is not very compelling to me. We have plenty of technical options for building emergency stop buttons, see for example my post here.

There have been some arguments that none of the identified technical options for building AI stop buttons will be useful or used, because they will all turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that these arguments show a theoretical possibility, but I think it is a very low possibility, so in practice these arguments are not very compelling to me. The more compelling failure mode argument is that people will refuse to use the emergency AI stop button, even though it is available.

Many of the posts with the tag above show failure scenarios where the AI fails to be aligned because of an underlying weakness or structural problem in society. These are scenarios where society fails to take the actions needed to keep its AIs aligned.

One can observe hat that in recent history, society has mostly failed to take the actions needed to keep major parts of the global economy aligned with human needs. See for example the oil industry and climate change. Or the cigarette industry and health.

One can be a pessimist, and use our past performance on climate change to predict how good we will be in handling the problem of keeping powerful AI under control. Like oil, AI is a technology that has compelling short-term economic benefits. This line of thought would offer a very powerful 1-page AI failure mode argument. To a pessimist.

Or one can be an optimist, and argue that the case of climate change is teaching us all very valuable lessons, so we are bound to handle AI better than oil. So will you be distilling for an audience of pessimists or optimists?

There is a political line of thought, which I somewhat subscribe to, that optimism is a moral duty. This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I'd rather spend my energy in finding ways to improve the odds. When it comes to the political sphere, a many problems often seem completely intractable, until suddenly there are not.

Comment by Koen.Holtman on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-04T11:05:43.755Z · LW · GW

I'll do the easier part of your question first:

I'm most interested in arguments for inner alignment failure. I'm pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven't really heard a rigorous case made for its plausibility.

I have not read all the material about inner alignment that has appeared on this forum, but I do occasionally read up on it.

There are some posters on this forum who believe that contemplating a set of problems which are together called 'inner alignment' can work as an intuition pump that would allow us to make needed conceptual breakthroughs. The breakthroughs sought have mostly to do, I believe, with analyzing possibilities for post-training treacherous turns which have so far escaped notice. I am not (no longer) one of the posters who have high hopes that inner alignment will work as a useful intuition pump.

The terminology problem I have with the term 'inner alignment' is that many working on it never make the move of defining it in rigorous mathematics, or with clear toy examples of what are and what are not inner alignment failures. Absent either a mathematical definition or some defining examples, I am not able judge if inner alignment is either the main alignment problem, or whether it would be a minor one, but still one that is extremely difficult to solve.

What does not help here is that by now several non-mathematical notions floating around of what an inner alignment failure even is, to the extent that Evan has felt a need to write an entire clarification post.

When poster X calls something an example of an inner alignment failure, poster Y might respond and declare that in their view of inner alignment failure, it is not actually an example of an inner alignment failure, or a very good example of an inner alignment failure. If we interpret it as a meme, then the meme of inner alignment has a reproduction strategy where it reproduces by triggering social media discussions about what it means.

Inner alignment has become what Minsky called a suitcase word: everybody packs their own meaning into it. This means that for the purpose of distillation, the word is best avoided. If you want to distil the discussion, my recommendation is to look for the meanings that people pack into the word.

Comment by Koen.Holtman on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-03T15:56:01.553Z · LW · GW

This is probably not the answer you are looking for, but as you are considering putting a lot of work into this...

Does anyone know if this has been done? If not, I might try to make it.

Probably has been done, but depends on what you mean with strongest arguments.

Does strongest mean that the argument has a lot of rhetorical power, so that it will convince people that alignment failure is more plausible than it actually is? Or does strongest mean that it gives the audience the best possible information about the likelihood of various levels of misalignment, where these levels go from 'annoying but can be fixed' to 'kills everybody and converts all matter in its light cone to paperclips'.

Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.

My main message here, I guess, is that many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem. If you are thinking about adding to this mountain of existing work, you need to carefully ask yourself who your target audience is, and what you want to convince them of.