Open positions: Research Analyst at the AI Standards Lab 2023-12-22T16:31:45.215Z
Demanding and Designing Aligned Cognitive Architectures 2021-12-21T17:32:57.482Z
Safely controlling the AGI agent reward function 2021-02-17T14:47:00.293Z
Graphical World Models, Counterfactuals, and Machine Learning Agents 2021-02-17T11:07:47.249Z
Disentangling Corrigibility: 2015-2021 2021-02-16T18:01:27.952Z
Creating AGI Safety Interlocks 2021-02-05T12:01:46.221Z
Counterfactual Planning in AGI Systems 2021-02-03T13:54:09.325Z
New paper: AGI Agent Safety by Iteratively Improving the Utility Function 2020-07-15T14:05:11.177Z
The Simulation Epiphany Problem 2019-10-31T22:12:51.323Z
New paper: Corrigibility with Utility Preservation 2019-08-06T19:04:26.386Z


Comment by Koen.Holtman on Shallow review of live agendas in alignment & safety · 2023-11-29T11:35:38.449Z · LW · GW

Thanks for reading my paper! For the record I agree with some but not all points in your summary.

My later paper 'AGI Agent Safety by Iteratively Improving the Utility Function' also uses the simulation environment with the and actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the and are supposed to be symbols representing arrows and for 'push # to later in time' and 'pull # earlier in time'.

The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!

No, the design of the agent is not motivated by the need to create an incentive to preserve the shutdown button itself, as required by desideratum 4 from Soares et al. Instead it is motivated by the desire to create an incentive to preserve agent's actuators that it will need to perform any physical actions incentivised by the shutdown reward function -- I introduce this as a new desideratum 6.

A discussion about shaping incentives or non-incentives to preserve the button (as a sensor) is in section 7.3, where I basically propose to enhance the indifference effects produced by the reward function by setting up the physical environment around the button in a certain way:

the physical implementation of the agent and the button can be constructed in such a way that substantial physical resources would be needed by the agent to perform any action that will press or disable the button.

For the record, adding to the agent design creates no incentive to press the shutdown button: if it did, this would be visible as actions in the simulation of the third line of figure 10, and also the proof in section 9 would not have been possible.

Comment by Koen.Holtman on Shallow review of live agendas in alignment & safety · 2023-11-29T10:54:34.490Z · LW · GW

Fun to see this is now being called 'Holtman's neglected result'. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:

I can't fully fault the world for neglecting 'Corrigibility with Utility Preservation' because it is full of a lot of dense math.

I wrote two followup papers to 'Corrigibility with Utility Preservation' which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great!

Does anyone have a technical summary?

The best technical summary of 'Corrigibility with Utility Preservation' may be my sequence on counterfactual planning which shows that the corrigible agents from 'Corrigibility with Utility Preservation' can also be understood as agents that do utility maximisation in a pretend/counterfactual world model.

For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment.

In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of 'solved' are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an 'Unstoppable Weasel' in the Corrigibility with Utility Preservation paper.

Comment by Koen.Holtman on Causality: A Brief Introduction · 2023-06-25T11:17:51.394Z · LW · GW

Ultimately, all statistical correlations are due to casual influences.

As a regular LW reader who has never been that into causality, this reads as a blisteringly hot take to me.

You are right this is somewhat blistering, especially for this LW forum.

I would have been less controversial for the authors to say that 'all statistical correlations can be modelled as casual influences'. Correlations between two observables can always be modelled as being caused by the causal dependence of both on the value of a certain third variable, which may (if the person making the model wants to) be defined as a hidden variable that cannot by definition be observed.

After is has been drawn up, such a causal model claiming that an observed statistical correlation is being caused by a causal dependency on a hidden variable might then be either confirmed or falsified, for certain values of confirmed or falsified that philosophers love to endlessly argue about, by 1) further observations or by 2) active experiment, an experiment where one does a causal intervention.

Pearl kind of leans towards 2) the active experiment route towards confirming or falsifying the model -- deep down, one of the points Pearl makes is that experiments can be used to distinguish between correlation and causation, that this experimentalist route has been ignored too much by statisticians and Bayesian philosophers alike, and that this route has also been improperly maligned by the Cigarette industry and other merchants of doubt.

Another point Pearl makes is that Pearl causal models and Pearl counterfactuals are very useful of mathematical tools that could be used by ex-statisticians turned experimentalists when they try to understand, and/or make predictions about, nondeterministic systems with potentially hidden variables.

This latter point is mostly made by Pearl towards the medical community. But this point also applies to doing AI interpretability research.

When it comes to the more traditional software engineering and physical systems engineering communities, or the experimental physics community for that matter, most people in these communities intuitively understand Pearl's point about the importance of doing causal intervention based experiments as being plain common sense. They understand this without ever having read the work or the arguments of Pearl first. These communities also use mathematical tools which are equivalent to using Pearl's do() notation, usually without even knowing about this equivalence.

Comment by Koen.Holtman on Seeking (Paid) Case Studies on Standards · 2023-05-28T18:49:33.327Z · LW · GW

One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.

While overcoming expert disagreement is a challenge, it is not one that is as big as you think. TL;DR: Deciding not to agree is always an option.

To expand on this: the fallback option in a safety standards creation process, for standards that aim to define a certain level of safe-enough, is as follows. If the experts involved cannot agree on any evidence based method for verifying that a system X is safe enough according to the level of safety required by the standard, then the standard being created will simply, and usually implicitly, declare that there is no route by which system X can comply with the safety standard. If you are required by law, say by EU law, to comply with the safety standard before shipping a system into the EU market, then your only legal option will be to never ship that system X into the EU market.

For AI systems you interact with over the Internet, this 'never ship' translates to 'never allow it to interact over the Internet with EU residents'.

I am currently in the JTC21 committee which is running the above standards creation process to write the AI safety standards in support of the EU AI Act, the Act that will regulate certain parts of the AI industry, in case they want to ship legally into the EU market. ((Legal detail: if you cannot comply with the standards, the Act will give you several other options that may still allow you to ship legally, but I won't get into explaining all those here. These other options will not give you a loophole to evade all expert scrutiny.))

Back to the mechanics of a standards committee: if a certain AI technology, when applied in a system X, is well know to make that system radioactively unpredictable, it will not usually take long for the technical experts in a standards committee to come to an agreement that there is no way that they can define any method in the standard for verifying that X will be safe according to the standard. The radioactively unsafe cases are the easiest cases to handle.

That being said, in all but the most trivial of safety engineering fields, there is a complicated epistemics involved in deciding when something is safe enough to ship, it is complicated whether you use standards or not. I have written about this topic, in the context of AGI, in section 14 of this paper.

Comment by Koen.Holtman on Aggregating Utilities for Corrigible AI [Feedback Draft] · 2023-05-13T19:02:30.435Z · LW · GW

I am currently almost fulltime doing AI policy, but I ran across this invite to comment on the draft, so here goes.

On references:

Please add Armstrong among the author list in the reference to Soares 2015, this paper had 4 authors, and it was actually Armstrong who came up with indifference methods.

I see both 'Pettigrew 2019' and 'Pettigrew 2020' in the text? Is the same reference?

More general:

Great that you compare the aggregating approach to two other approaches, but I feel your description of these approaches needs to be improved.

Soares et al 2015 defines corrigibility criteria (which historically is its main contribution), but the paper then describes a failed attempt to design an agent that meets them. The authors do not 'worry that utility indifference creates incentives to manage the news' as in your footnote, they positively show that their failed attempt has this problem. Armstrong et al 2017 has a correct design, I recall, that meets the criteria from Soares 2015, but only for a particular case. 'Safely interruptible agents' by Orseau and Armstrong 2016 also has a correct and more general design, but does not explicitly relate it back to the original criteria from Soares et al, and the math is somewhat inaccessible. Holtman 2000 'AGI Agent Safety by Iteratively Improving the Utility Function' has a correct design and does relate it back to the Soares et al criteria. Also it shows that indifference methods can be used for repeatedly changing the reward function, which addresses one of your criticisms that indifference methods are somewhat limited in this respect -- this limitation is there in the math of Soares, but not more generally for indifference methods. Further exploration of indifference as a design method is in some work by Everitt and others (work related to causal influence diagrams), and also myself (Counterfactual Planning in AGI Systems).

What you call the 'human compatible AI' method is commonly referred to as CIRL, human compatible AI is a phrase which is best read as moral goal, design goal, or call to action, not a particular agent design. The key defining paper following up on the ideas in 'the off switch game' you want to cite is Hadfield-Menell, Dylan and Russell, Stuart J and Abbeel, Pieter and Dragan, Anca, Cooperative Inverse Reinforcement Learning. In that paper (I recall from memory, it may have already been in the off-switch paper too), the authors offer the some of the same criticism of their method that you describe as being offered by MIRI, e.g. in the ASX writeup you cite.

Other remarks:

In the penalize effort action, can you clarify more on how E(A), the effort metric, can be implemented?

I think that Pettigrew's considerations, as you describe them, are somewhat similar to those in 'Self-modification of policy and utility function in rational agents' by Everitt et al. This paper is somewhat mathematical but might be an interesting comparative read for you, I feel it usefully charts the design space.

You may also find this overview to be an interesting read, if you want to clarify or reference definitions of corrigibility.

Comment by Koen.Holtman on Let’s think about slowing down AI · 2022-12-24T15:20:20.555Z · LW · GW

As requested by Remmelt I'll make some comments on the track record of privacy advocates, and their relevance to alignment.

I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards limiting what large companies and governments are actually allowed to do with all that data.

The digital civil rights community has long promoted the idea that Internet based platforms and other computer systems must be designed and run in a way that is aligned with human values. In the context of AI and ML based computer systems, this has led to demands for AI fairness and transparency/explainability that have also found their way into policy like the GDPR, legislation in California, and the upcoming EU AI Act. AI fairness demands have influenced the course of AI research being done, e.g. there has been research on defining it even means for an AI model to be fair, and on making models that actually implement this meaning.

To a first approximation, privacy and digital rights advocates will care much more about what an ML model does, what effect its use has on society, than about the actual size of the ML model. So they are not natural allies for x-risk community initiatives that would seek a simple ban on models beyond a certain size. However, they would be natural allies for any initiative that seeks to design more aligned models, or to promote a growth of research funding in that direction.

To make a comment on the premise of the original post above: digital rights activists will likely tell you that, when it comes to interventions on AI research, speculating about the tractability of 'slowing down AI research' is misguided. What you really should be thinking about is changing the direction of AI research.

Comment by Koen.Holtman on AGI Timelines in Governance: Different Strategies for Different Timeframes · 2022-12-23T10:57:24.507Z · LW · GW


I am not aware of any good map of the governance field.

What I notice is that EA, at least the blogging part of EA, tends to have a preference for talking directly to (people in) corporations when it comes to the topic of corporate governance. As far as I can see, FLI is the AI x-risk organisation most actively involved in talking to governments. But there are also a bunch of non-EA related governance orgs and think tanks talking about AI x-risk to governments. When it comes to a broader spectrum of AI risks, not just x-risk, there are a whole bunch of civil society organisations talking to governments about it, many of them with ties to, or an intellectual outlook based on, Internet and Digital civil rights activism.

Comment by Koen.Holtman on AGI Timelines in Governance: Different Strategies for Different Timeframes · 2022-12-20T17:36:29.576Z · LW · GW

I think you are ignoring the connection between corporate governance and national/supra-national government policies. Typically, corporations do not implement costly self-governance and risk management mechanisms just because some risk management activists have asked them nicely. They implement them if and when some powerful state requires them to implement them, requires this as a condition for market access or for avoiding fines and jail-time.

Asking nicely may work for well-funded research labs who do not need to show any profitability, and even in that special case one can have doubts about how long their do-not-need-to-be-profitable status will last. But definitely, asking nicely will not work for your average early-stage AI startup. The current startup ecosystem encourages the creation of companies that behave irresponsibly by cutting corners. I am less confident than you are that Deepmind and OpenAI have a major lead over these and future startups, to the point where we don't even need to worry about them.

It is my assessment that, definitely in EA and x-risk circles, too few people are focussed on national government policy as a means to improve corporate governance among the less responsible corporations. In the case of EA, one might hope that recent events will trigger some kind of update.

Comment by Koen.Holtman on You can still fetch the coffee today if you're dead tomorrow · 2022-12-12T15:57:22.642Z · LW · GW

Note: This is presumably not novel, but I think it ought to be better-known.

This indeed ought to be better-known. The real question is: why is it not better-known?

What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.

When you then ask people why they believe that, you usually get some answer 'because MIRI', and then when you ask further it turns out these people did not actually read MIRI's more technical papers, they just heard about them.

The conventional wisdom 'nobody knows how to build myopic agents' is not true for the class of all agents, as your post illustrates. In the real world, applied AI practitioners use actually existing AI technology to build myopic agents, and corrigible agents, all the time. There are plenty of alignment papers showing how to do these things for certain models of AGI too: in the comment thread here I recently posted a list.

I speculate that the conventional rationalist/EA wisdom of 'nobody knows how to do this' persists because of several factors. One of them is just how social media works, Eternal September, and People Do Not Read Math, but two more interesting and technical ones are the following:

  1. It is popular to build analytical models of AGI where your AGI will have an infinite time horizon by definition. Inside those models, making the AGI myopic without turning it into a non-AGI is then of course logically impossible. Analytical models built out of hard math can suffer from this built-in problem, and so can analytical models built out of common-sense verbal reasoning, In the hard math model case, people often discover an easy fix. In verbal models, this usually does not happen.

  2. You can always break an agent alignment scheme by inventing an environment for the agent that breaks the agent or the scheme. See johnswentworth's comment elsewhere in the comment section for an example of this. So it is always possible to walk away from a discussion believing that the 'real' alignment problem has not been solved.

Comment by Koen.Holtman on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-29T20:09:04.122Z · LW · GW

I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:

In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one's conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.

I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one's conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.

Comment by Koen.Holtman on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-29T17:08:57.351Z · LW · GW

Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.

I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom:

  1. Corrigibility is anti-natural to consequentialist reasoning

I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.

I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist.

I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability.

If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility.

Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment cannot be cheap, more regulation will be needed to make sure that operating a massively unaligned AI will always be more expensive for a company to do than operating a mostly aligned AI. So we are looking at regulatory instruments like taxation, fines, laws that threaten jail time, and potentially measures inside the semiconductor supply chain, all depending on what type of AGI will become technically feasible, if ever.

Comment by Koen.Holtman on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-27T18:29:13.762Z · LW · GW

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

  • My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.

  • On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism.

  • On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.

  • I do not think I fully support my 2019 statement anymore that 'Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model'. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.

Comment by Koen.Holtman on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-27T17:32:45.207Z · LW · GW

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.

Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.

Here is the list, with the bold headings describing different approaches to corrigibility.

Indifference to being switched off, or to reward function updates

Motivated Value Selection for Artificial Agents introduces Armstrong's indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.

Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.

AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong's indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.

Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong's indifference methods. This paper has proof-by-construction type of math.

Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.

Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong's indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.

How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.

Agents that stop to ask a supervisor when unsure

A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.

Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.

Anything about model-based reinforcement learning

I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.

Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.


Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.

Commanding the agent to be corrigible

If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.

Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think of a mathematical paper about this that I would recommend.

AIs that are corrigible because they are not agents

Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.


Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.

Comment by Koen.Holtman on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-27T13:03:04.409Z · LW · GW

Hi Akash! Thanks for the quick clarifications, these make the contest look less weird and more useful than just a 500 word essay contest.

My feedback here is that I definitely got the 500 word essay contest vibe when I read the 'how it works' list on the contest home page, and this vibe only got reinforced when I clicked on the official rules link and skimmed the document there. I recommend that you edit the 'how it works' list to on the home page, to make it it much more explicit that the essay submission is often only the first step of participating, a step that will lead to direct feedback, and to clarify that you expect that most of the prize money will go to participants who have produced significant research beyond the initial essay. If that is indeed how you want to run things.

On judging: OK I'll e-mail you.

I have to think more about your question about posting a writeup on this site about what I think are the strongest proposals for corrigibility. My earlier overview writeup that explored the different ways how people define corrigibility took me a lot of time to write, so there is an opportunity cost I am concerned about. I am more of an academic paper writing type of alignment researcher than a blogging all of my opinions on everything type of alignment researcher.

On the strongest policy proposal towards alignment and corrigibility, not technical proposal: if I limit myself to the West (I have not looked deeply into China, for example) then I consider the EU AI Act initiative by the EU to be the current strongest policy proposal around. It is not the best proposal possible, and there are a lot of concerns about it, but if I have to estimate expected positive impact among different proposals and initiatives, this is the strongest one.

Comment by Koen.Holtman on Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) · 2022-11-27T12:16:25.673Z · LW · GW

Related to this, from the blog post What does Meta AI’s Diplomacy-winning Cicero Mean for AI?:

The same day that Cicero was announced, there was a friendly debate at the AACL conference on the topic "Is there more to NLP [natural language processing] than Deep Learning,” with four distinguished researchers trained some decades ago arguing the affirmative and four brilliant young researchers more recently trained arguing the negative. Cicero is perhaps a reminder that there is indeed a lot more to natural language processing than deep learning.

I am originally a CS researcher trained several decades ago, actually in the middle of an AI winter. That might explain our different viewpoints here. I also have a background in industrial research and applied AI, which has given me a lot of insight into the vast array of problems that academic research refuses to solve for you. More long-form thoughts about this are in my Demanding and Designing Aligned Cognitive Architectures.

From where I am standing, the scaling hype is wasting a lot of the minds of the younger generation, wasting their minds on the problem of improving ML benchmark scores under the unrealistic assumption that ML will have infinite clean training data. This situation does not fill me with as much existential dread as it does some other people on this forum, but anyway.

Comment by Koen.Holtman on Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) · 2022-11-27T10:50:34.331Z · LW · GW

Related to our discussion earlier, I see that Marcus and Davis just published a blog post: What does Meta AI’s Diplomacy-winning Cicero Mean for AI?. In it, they argue, as you and I both would expect, that Cicero is a neurosymbolic system, and that its design achieves its results by several clever things beyond using more compute and more data alone. I expect you would disagree with their analysis.

Thanks for the very detailed description of your view on GAN history and sociology -- very interesting.

You focus on the history of benchmark progress after DLL based GANs were introduced as a new method for driving that progress. The point I was trying to make is about a different moment in history: I am perceiving that the original introduction of DLL based GANs was a clear discontinuity.

First, GANs may not be new.

If you search wide enough for similar things, then no idea that works is really new. Neural nets were also not new when the deep learning revolution started.

I think your main thesis here is that academic researcher creativity and cleverness, their ability to come up with unexpected architecture improvements, has nothing to do with driving the pace of AI progress forward:

This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures... and the gains disappear when you control for compute/n/param.

Sorry, but you cannot use a simple control-for-compute/n/param statistics approach to determine the truth of any hypothesis of how clever researchers really were in coming up with innovations to keep an observed scaling curve going. For all you know. these curves are what they are because everybody has been deeply clever at the architecture evolution/revolution level, or at the hyperparameter tuning level. But maybe I am mainly skeptical of your statistical conclusions here because you are are leaving things out of the short description of the statistical analysis you refer to. So if you want can give me a pointer to a more detailed statistical writeup, one that tries to control for cleverness too, please do.

That being said, like you I perceive, in a more anecdotal form, that true architectural innovation is absent from a lot of academic ML work, or at least the academic ML work appearing in the so-called 'top' AI conferences that this forum often talks about. I mostly attribute that to such academic ML only focusing on a very limited set of big data / Bitter Lesson inspired benchmarks, benchmarks which are not all that relevant to many types of AI improvements one would like to see in the real world. In industry, where one often needs to solve real-world problems beyond those which are fashionable in academia, I have seen a lot more creativity in architectural innovations than in the typical ML benchmark improvement paper. I see a lot of that industry-type creativity in the Cicero paper too.

You mention that your compute-and-data-is-all-that-drives-progress opinion has been informed by looking at things like GANs for image generation and embedding research.

This progress in these sub-fields differs from the type of AI technology progress that I would like to see much more of, as an AI safety and alignment researcher. This also implies that I have different opinion on what drives or should drive AI technology progress.

One benchmark that interests me is an AI out-of-distribution robustness benchmark where the model training happens on sample data drawn from a first distribution, and the model evaluation happens on sample data drawn from a different second distribution, only connected to the first by having the two processes that generate them share some deeper patterns like the laws of physics, or broad parameters of human morality.

This kind of out-of-distribution robustness problem is one of the themes of Marcus too, for the physics part at least. One of the key arguments for the hybrid/neurosymbolic approach is that you will need to (symbolically) encode some priors about these deeper patterns into the AI, if you ever want it to perform well on such out-of-distribution benchmarks.

Another argument for the neurosymbolic approach is that you often simply do not have enough training data to get your model robust enough if you start from a null prior, so you will need to compensate for this by adding some priors. Having deeply polluted training data also means you will need to add priors, or do lots of other tricks, to get the model you really want. There is an intriguing possibility that DNN based transfer learning might contribute to the type of benchmarks I am interested in. This branch of research is usually framed in a way where people do not picture the the second small training data set being used in the transfer learning run as a prior, but on a deeper level it is definitely a prior.

You have been arguing that symbolic+scaling is all we need to drive AI progress, that there is no room for the neuro+symbolic+scaling approach. This argument rests on a hidden assumption that many academic AI researchers also like to make: the assumption that for all AI application domains that you are interested in, you will never run out of clean training data.

Doing academic AI research under the assumption that you always have infinite clean training data assumption would be fine if such research were confined to one small and humble sub-branch of academic AI. The problem is that the actual branch of AI making this assumption is far from small and humble. It in fact claims, via writings like the Bitter Lesson, to be the sum total of what respectable academic AI research should be all about. It is also the sub-branch that gets almost all the hype and the press.

The availability of infinite clean training data assumption is of course true for games that can be learned by self-play. It is less true for many other things that we would like AI to be better at. The 'top' academic ML conferences are slowly waking up to this, but much too slowly as far as I am concerned.

Comment by Koen.Holtman on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility · 2022-11-26T16:20:49.627Z · LW · GW

As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.

First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.

What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.

However, the underlying idea that meaningful progress is possible by inviting people to work on a 500 word essay, which will then first be judged by 'approximately 10 Judges who are undergraduate and graduate students', seems to be a bit strange. I can fully understand Sam Bowman's comment that this might all look very weird to ML people. What you have here is an essay contest. Calling it a research contest may offend some people who are actual card-carrying researchers.

Also, the more experienced judges you have represent somewhat of an insular sub-community of AI safety researchers. Specifically, I associate both Nate and John with the viewpoint that alignment can only be solved by nothing less than an entire scientific revolution. This is by now a minority opinion inside the AI safety community, and it makes me wonder what will happen to submissions that make less radical proposals which do not buy into this viewpoint.

OK, I can actually help you with the problem of an unbalanced judging panel: I volunteer to join it. If you are interested, please let me know.

Corrigibility is both

  • a technical problem: inventing methods to make AI more corrigible

  • a policy problem: forcing people deploying AI to use those methods, even if this will hurt their bottom line, even if these people are careless fools, and even if they have weird ideologies.

Of these two problems, I consider the technical problem to be mostly solved by now, even for AGI.
The big open problem in corrigibility is the policy one. So I'd like to see contest essays that engage with the policy problem.

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs, rather than speculation or gut feelings. Of course, in the AI safety activism blogosphere, almost nobody wants to read or talk about these methods in the papers with the proofs, instead everybody bikesheds the proposals which have been stated in natural language and which have been backed up only by speculation and gut feelings. This is just how a blogosphere works, but it does unfortunately add more fuel to the meme that the technical side of corrigibility is mostly unsolved and that nobody has any clue.

Comment by Koen.Holtman on Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) · 2022-11-24T23:39:51.899Z · LW · GW

Thanks, that does a lot to clarify your viewpoints. Your reply calls for some further remarks.

I'll start off by saying that I value your technology tracking writing highly because you are one of those blogging technology trackers who is able to look beyond the press releases and beyond the hype. But I have the same high opinion of the writings of Gary Marcus.

This seems to be what you are doing here: you handwave away the use of BART and extremely CPU/GPU-intensive search as not a victory for scaling

For the record: I am not trying to handwave the progress-via-hybrid-approaches hypothesis of Marcus into correctness. The observations I am making here are much more in the 'explains everything while predicting nothing' department.

I am observing out that both your progress-via-scaling hypothesis and the progress-via-hybrid-approaches hypothesis of Marcus can be made to explain the underlying Cicero facts here. I do not see this case as a clear victory for either one of these hypotheses. What we have here is an AI design that cleverly combines multiple components while also being impressive in the scaling department.

Technology tracking is difficult, especially about the future.

The following observation may get to the core of how I may be perceiving the elephant differently. I interpret an innovation like GANs not as a triumph of scaling, but as a triumph of cleverly putting two components together. I see GANs as an innovation that directly contradicts the message of the Bitter Lesson paradigm, one that is much more in the spirit of what Marcus proposes.

Here is what I find particularly interesting in Marcus. In pieces like like The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence, Marcus is advancing the hypothesis that the academic Bitter-Lesson AI field is in a technology overhang: these people could make make a lot of progress on their benchmarks very quickly, faster than mere neural net scaling will allow, if they were to ignore the Bitter Lesson paradigm and embrace a hybrid approach where the toolbox is much bigger than general-purpose learning, ever-larger training sets, and more and more compute. Sounds somewhat plausible to me.

If you put a medium or high probability on this overhang hypothesis of Marcus, then you are in a world where very rapid AI progress might happen, levels of AI progress much faster than those predicted by the progress curves produced by Bitter Lesson AI research.

You seem to be advancing an alternative hypothesis, one where advances made by clever hybrid approaches will always be replicated a few years later by using a Bitter Lesson style monolithic deep neural net trained with a massive dataset. This would conveniently restore the validity of extrapolating Bitter Lesson driven progress curves, because you can use them as an upper bound. We'll see.

I am currently not primarily in the business of technology tracking, I am an AI safety researcher working on safety solutions and regulation. With that hat on, I will say the following.

Bitter-lesson style systems consisting of a single deep neural net, especially if these systems are also model-free RL agents, have huge disadvantages in the robustness, testability, and interpretability departments. These disadvantages are endlessly talked about on this web site of course. By contrast, systems built out of separate components with legible interfaces between them are usually much more robust, interpretable and testable. This is much less often mentioned here.

In safety engineering for any high-risk application, I would usually prefer to work with an AI system built out of many legible sub-components, not with some deep neural net that happens to perform equally or better on an in-training-distribution benchmark. So I would like to see more academic AI research that ignores the Bitter Lesson paradigm, and the paradigm that all AI research must be ML research. I am pleased to say that a lot of academic and applied AI researchers, at least in the part of the world where I live, never got on board with these paradigms in the first place. To find their work, you have look beyond conferences like NeurIPS.

Comment by Koen.Holtman on Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) · 2022-11-23T22:53:47.963Z · LW · GW

This is not particularly unexpected if you believed in the scaling hypothesis.

Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand.

So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis, put forward by people like Gary Marcus, that scaling alone as the sole technique has run out of steam, and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)

Another way to explain my lack of surprise would be to say that Cicero is a just super-human board game playing engine that has been equipped with a voice synthesizer. But I might be downplaying the achievement here.

this is among the worser things you could be researching [...] There are... uh, not many realistic, beneficial applications for this work.

I have not read any of the authors' or Meta's messaging around this, so I am not sure if they make that point, but the sub-components of Cicero that somewhat competently and 'honestly' explain its currently intended moves seem to have beneficial applications too, if they were combined with an engine which is different from a game engine that absolutely wants to win and that can change it's mind about moves to play later. This is a dual-use technology with both good and bad possible uses.

That being said, I agree that this is yet another regulatory wake-up call, if we would need one. As a group, AI researchers will not conveniently regulate themselves: they will move forward in creating more advanced dual-use technology, while openly acknowledging (see annex A.3 of the paper) that this technology might be used for both good and bad purposes downstream. So it is up to the rest of the world to make sure that these downstream uses are regulated.

Comment by Koen.Holtman on Ways to buy time · 2022-11-18T21:54:26.344Z · LW · GW

I am an AI/AGI alignment researcher. I do not feel very optimistic about the effectiveness of your proposed interventions, mainly because I do not buy your underlying risk model and and solution model. Overall I am getting a vibe that you believe AGI will be invented soon, which is a valid assumption for planning specific actions, but then things get more weird in your solution model. To give one specific example of this:

It will largely be the responsibility of safety and governance teams to push labs to not publish papers that differentially-advance-capabilities, maintain strong information security, invest in alignment research, use alignment strategies, and not deploy potentially dangerous models.

There is an underlying assumption in the above reasoning, and in many other of your slowdown proposals, that the AI labs themselves will have significant influence on how their AI innovations will be used by downstream actors. You are assuming that they can prevent downstream actors from creating misaligned AI/AGI by not publishing certain research and not releasing foundation models with certain capabilities.

This underlying assumption, one where the labs or individual ML researchers have significant choke-point power that can lower x-risk, is entirely wrong. To unpack this statement a bit more: current advanced AI, including foundation models, is a dual-use technology that can be configured to do good as well as evil, that has the potential to be deployed by actors who will be very careful about it, and other actors who will be very careless. Also, we have seen that if one lab withholds its latest model, another party will quickly open-source an equally good model. Maybe real AGI, if it ever gets invented, will be a technology with an entirely different nature, but I am not going to bet on it.

More generally: I am seeing you make a mistake that I have seen a whole crowd of influencers and community builders is making, You are following the crowd, and the crowd focuses too much on the idea that they need to convince 'researchers in top AI labs' and other 'ML researchers' in 'top conferences' about certain dangers:

  • The crowd focuses on influencing AI research labs and ML researchers without considering if these parties have the technical or organisational/political power to control how downstream users will use AI or future AGI. In general, they do not have this power to control. If you are really worried about an AI lab inventing an AGI soon (personally I am not, but for the sake of the argument), you will need to focus on its management, not on its researchers.

  • The crowd focuses on influencing ML researchers without considering if these parties even have the technical skills or attitude needed to be good technical alignment researchers. Often, they do not. (I expand on this topic here. The TL;DR: treating the management of the impact of advances in ML on society as an ML research problem makes about as much sense as our forefathers treating the management of the impact of the stream engine on society as a steam engine engineering problem. For the long version, see the paper linked to the post.)

Overall, when it comes to putting more manpower into outreach, I feel that safety awareness outreach to downstream users, and those who might regulate their actions via laws, moral persuasion, or product release decisions, is far more important.

Comment by Koen.Holtman on Don't design agents which exploit adversarial inputs · 2022-11-18T16:51:04.507Z · LW · GW

Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.

Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.

I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.

Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.

Comment by Koen.Holtman on I (with the help of a few more people) am planning to create an introduction to AI Safety that a smart teenager can understand. What am I missing? · 2022-11-16T16:57:17.075Z · LW · GW

My quick take here is that your list of topics is not an introduction to AI Safety, it is an introduction to AI safety as seen from inside the MIRI/Yudkowsky bubble, where everything is hard, and nobody is making any progress. Some more diversity in viewpoints would be better.

For your audience, my go-to source would be to cover bits of Christian's The Alignment Problem.

Comment by Koen.Holtman on Applying superintelligence without collusion · 2022-11-14T22:52:17.765Z · LW · GW

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.

I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is inevitable.

It is easy to imagine the existence of a very intelligent person who nevertheless hates colluding with other people. It is easy to imagine the existence of an AI which approximately maximises a reward function which has a term in it that penalises collusion. So why is nobody working on creating or improving such an AI or penalty term?

My current x-risk community model is that the forces that channel people away from under-explored parts of the solution space have nothing to do with tacit assumptions about impossibilities. These forces operate at a much more pre-rational level of human psychology. Specifically: if there is no critical mass of people working in some part of the solution space already, then human social instincts will push most people away from starting to work there, because working there will necessarily be a very lonely affair. On a more rational level, the critical mass consideration is that if you want to do work that gets you engagement on your blog post, or citations on your academic paper, the best strategy is pick a part of the solution space that already has some people working in it.

TL;DR: if you want to encourage people to explore an under-visited part of the solution space, you are not primarily fighting against a tacit belief that this part of the space will be empty of solutions. Instead, you will need to win the fight against the belief that people will be lonely when they go into that part of the space.

Comment by Koen.Holtman on I there a demo of "You can't fetch the coffee if you're dead"? · 2022-11-11T17:00:29.459Z · LW · GW

Like Charlie said, there is a demonstration in AI Safety Gridworlds. I also cover these dynamics in a more general and game-theoretical sense in my AGI Agent Safety by Iteratively Improving the Utility Function: this paper also has running code behind it, and it formalises the setup as a two-player/two-agent game.

In general though, if people do not buy "You can't fetch the coffee if you're dead" problem as a thought experiment, then I am not sure if any running code based demo can change their mind.

I have been constructing a set of thought experiments, illustrated with grid worlds, that do not just demo the off-switch problem, but that also demo a solution to it. The whole setup intends to clarify what is really going on here, in a way that makes intuitive sense to a non-mathematical audience. Have not published these thought experiments yet in writing, only gave a talk about it. In theory, somebody could convert the grid world pictures in this talk into running code. If you want to learn more please contact me -- I can walk you through my talk slide deck.

I think I disagree with Charlie's hot take because Charlie seems to be assuming that the essence of the solution to "You can't fetch the coffee if you're dead" must be too complicated to show in a grid world. In fact, for the class of solutions I prefer, these solutions can be very easily shown in a grid world. Or at least easy in retrospect.

Comment by Koen.Holtman on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T18:22:53.575Z · LW · GW

It depends. But yes, incorrect epistemics can make an AGI safer, if it is the right and carefully calibrated kind of incorrect. A goal-directed AGI that incorrectly believes that its off switch does not work will be less resistant to people using it. So the goal is here to design an AGI epistemics that is the right kind of incorrect.

Note: designing an AGI epistemics that is the right kind of incorrect seems to go against a lot of the principles that aspiring rationalists seem to hold dear, but I am not an aspiring rationalist. For more technical info on such designs, you can look up my sequence on counterfactual planning.

Comment by Koen.Holtman on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T18:12:45.075Z · LW · GW

Yes, a lot of it has been informed by economics. Some authors emphasize the relation, others de-emphasize it.

The relation goes beyond alignment and safety research. The way in which modern ML research defines its metric of AI agent intelligence is directly based on utility theory, which was developed by Von Neumann and Morgenstern to describe games and economic behaviour.

Comment by Koen.Holtman on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T18:01:25.710Z · LW · GW

Both explainable AI and interpretable AI are pronouns that are being used to have different meanings in different contexts. It really depends on the researcher what they mean by it.

Comment by Koen.Holtman on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T17:54:06.273Z · LW · GW

Decision theory is a term used in mathematical statistics and philosophy. In applied AI terms, a decision theory is the algorithm used by an AI agent to compute what action to take next. The nature of this algorithm is obviously relevant to alignment. That being said, philosophers like to argue among themselves about different decision theories and how they relate to certain paradoxes and limit cases, and they conduct these arguments using a terminology that is entirely disconnected from that being used in most theoretical and applied AI research. Not all AI alignment researchers believe that these philosophical arguments are very relevant to moving forward AI alignment research.

Comment by Koen.Holtman on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T16:59:54.110Z · LW · GW

It is definitely advisable to build a paper-clip maximiser that also needs to respect a whole bunch of additional stipulations about not harming people. The worry among many alignment researchers is that it might be very difficult to make these stipulations robust enough to deliver the level of safety we ideally want, especially in the case of AGIs that might get hugely intelligent or hugely powerful. As we are talking about not-yet-invented AGI technology, nobody really knows how easy or hard it will be to build robust-enough stipulations into it. It might be very easy in the end, but maybe not. Different researchers have different levels of optimism, but in the end nobody knows, and the conclusion remains the same no matter what the level of optimism is. The conclusion is to warn people about the risk and to do more alignment research with the aim to make it easier build robust-enough stipulations into potential future AGIs.

Comment by Koen.Holtman on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T16:15:28.129Z · LW · GW

When one uses mathematics to clarify many AI alignment solutions, or even just to clarify Monte Carlo tree search as a decision making process, then the mathematical structures one finds can often best be interpreted as being mathematical counterfactuals, in the Pearl causal model sense. This explains the interest into counterfactual machine reasoning among many technical alignment researchers.

To explain this without using mathematics: say that we want to command a very powerful AGI agent to go about its duties while acting as if it cannot successfully bribe or threaten any human being. To find the best policy which respects this 'while acting as if' part of the command, the AGI will have to use counterfactual machine reasoning.

Comment by Koen.Holtman on Clarifying AI X-risk · 2022-11-03T15:18:14.998Z · LW · GW

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:

[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.

In this context, I of course have to observe that any human decision, any decision to deploy an AGI agent that uses purely consequentialist planning towards maximising a simple metric, would be a very poor human decision to make indeed. But there are plenty of other poor decisions too that we need to worry about.

Comment by Koen.Holtman on Intent alignment should not be the goal for AGI x-risk reduction · 2022-11-01T12:32:26.714Z · LW · GW

To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans

I agree that this kind of work is massively overlooked by this community. I have done some investigations on the root causes of why it is overlooked. The TL;DR is that this work is less technically interesting, and that many technical people here (and in industry and academia) would like to avoid even thinking about any work that needs to triangulate between different stakeholders who might then get mad at them. For a longer version of this analysis, see my paper Demanding and Designing Aligned Cognitive Architectures, where I also make some specific recommendations.

My overall feeling is that the growth in the type of technical risk reduction research you are calling for will will have to be driven mostly by 'demand pull' from society, by laws and regulators that ban certain unaligned uses of AI.

Comment by Koen.Holtman on "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability · 2022-11-01T10:55:43.867Z · LW · GW

But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell's approach, and this seems bizarre.

Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.

Stuart Russell's approach is a broad agenda, and I am not on board with of all parts of it, but I definitely read his provable safety slogan as a call for more attention to the design approach where certain AI properties (like safety and interpretability properties) are robustly created by construction.

There is an analogy with computer programming here: a deep neural net is like a computer program written by an amateur without any domain knowledge, one that was carefully tweaked to pass all tests in the test suite. Interpreting such a program might be very difficult. (There is also the small matter that the program might fail spectacularly when given inputs not present in the test suite.) The best way to create an actually interpretable program is to build it from the ground up with interpretability in mind.

What is notable here is that the CS/software engineering people who deal with provable safety properties have long ago rejected the idea that provable safety should be about proving safe an already-existing bunch of spaghetti code that has passed a test suite. The problem of interpreting or reverse engineering such code is not considered a very interesting or urgent one in CS. But this problem seems to be exactly what a section of the ML community has now embarked on. As an intellectual quest, it is interesting. As a safety engineering approach for high-risk system components, I feel it has very limited potential.

Comment by Koen.Holtman on So, geez there's a lot of AI content these days · 2022-10-20T09:47:01.648Z · LW · GW

First some background on me, then some thoughts.

I am an alignment researcher and I read LW and AF occasionally. I tend to focus more on reading academic papers, not the alignment blogosphere. I read LW and AF mostly to find links to academic papers I might otherwise overlook, and for the occasional long-from analysis blogpost that the writer(s) put several months in to write. I am not a rationalist.

What I am seeing on LW is that numerically, many of the AI posts are from from newcomers to the alignment field, or from people who are just thinking about getting in. This is perfectly is fine, because they need some place to post and potentially get their questions answered. I do not think that the cause of alignment would be improved by moving all of these AI newcomer posts out of LW and onto AF,

So if there is a concern that high-quality long-form rationalist content is being drowned out by all the AI talk, I suggest you create an AF-like sub-forum dedicated to rationalist thought.

The AF versions of posts are primarily meant to be a thing you can link to professionally without having to explain the context of a lot of weird, not-obviously-related topics that show up on LessWrong.

From were I am standing, professionally speaking, AF has plenty of way-to-weird AI alignment content on it. Any policy maker or card-carrying AI/ML researcher browsing AF will quickly conclude that it is a place where posters can venture far outside of their political or main stream science Overton windows, without ever being shouted down or even frowned upon by the rest of the posters. It is also the most up-voted and commented-on posts that are often the least inside any Overton window. This is just a thing that has grown historically, there is definitely beauty and value in it, and it is definitely is too late to change now. Too late also given that that EY has now gone full prophet-of-doom.

What I am hearing is that some alignment newcomers who have spent a few months doing original research, and writing a paper on it, have trouble getting their post on their results promoted from LW to AF. This is a de-motivator which I feel limits the growth of the field, so I would not mind if the moderators of this site start using (and advertising that they are using) an automatic rule where, if it is clear that the post publishes alignment research results that took moths of honest effort to produce, any author request to promote it to AF will be almost automatically granted, no matter what the moderators think about the quality of the work inside.

Comment by Koen.Holtman on AI alignment with humans... but with which humans? · 2022-09-15T20:30:04.026Z · LW · GW

You are welcome. Another answer to your question just occurred to me.

If you count AI fairness research as a sub-type of AI alignment research, then you can find a whole community of alignment researchers who talk quite a lot with each other about 'aligned with whom' in quite sophisticated ways. Reference: the main conference of this community is ACM FAccT.

In EA and on this forum, when people count the number of alignment researchers, they usually count dedicated x-risk alignment researchers only, and not the people working on fairness, or on the problem of making self-driving cars safer. There is a somewhat unexamined assumption in the AI x-risk community that fairness and self-driving car safety techniques are not very relevant to managing AI x-risk, both in the technical space and the policy space. The way my x-risk technical work is going, it is increasingly telling me that this unexamined assumption is entirely wrong.

On a lighter note:

ignoring those values means we won't actually achieve 'alignment' even when we think we have.

Well, as long as the 'we' you are talking about here is a group of people that still includes Eliezer Yudkowsky, then I can guarantee that 'we' are in no danger of ever collectively believing that we have achieved alignment.

Comment by Koen.Holtman on AI alignment with humans... but with which humans? · 2022-09-14T12:47:58.423Z · LW · GW

When AI alignment researchers talk about 'alignment', they often seem to have a mental model where either (1) there's a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there's all 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks.


So, I'm left wondering what AI safety researchers are really talking about when they talk about 'alignment'.

The simple answer here is that many technical AI safety researchers on this forum talk exclusively about (1) and (2) so that they can avoid confronting all of the difficult socio-political issues you mention. Many of them avoid it specifically because they believe they would not be very good at politics anyway.

This is of course a shame, because the cases between (1) and (2) have a level of complexity that also needs to be investigated. I am a technical AI safety researcher who is increasingly moving into the space between (1) and (2), in part also because I consider (1) and (2) to be more solved than many other AI safety researchers on this forum like to believe.

This then has me talking about alignment with locally applicable social contracts, and about the technology of how such social contracts can be encoded into an AI. See for example the intro post and paper here.

Comment by Koen.Holtman on Benchmark for successful concept extrapolation/avoiding goal misgeneralization · 2022-07-25T08:43:49.965Z · LW · GW

this is something you would use on top of a model trained and monitored by engineers with domain knowledge.

OK, that is a good way to frame it.

Comment by Koen.Holtman on Benchmark for successful concept extrapolation/avoiding goal misgeneralization · 2022-07-07T20:55:14.928Z · LW · GW

I guess I should make another general remark here.

Yes, using implicit knowledge in your solution would be considered cheating, and bad form, when passing AI system benchmarks which intend to test more generic capabilities.

However, if I were to buy an alignment solution from a startup, then I would prefer to be told that the solution encodes a lot of relevant implicit knowledge about the problem domain. Incorporating such knowledge would no longer be cheating, it would be an expected part of safety engineering.

This seeming contradiction is of course one of these things that makes AI safety engineering so interesting as a field.

Comment by Koen.Holtman on Benchmark for successful concept extrapolation/avoiding goal misgeneralization · 2022-07-07T20:39:47.081Z · LW · GW

Interesting. Some high-level thoughts:

When reading your definition of concept extrapolation as it appears here here:

Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation... and extrapolating it safely to a more general situation.

this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-of-training distribution and then asking for supervisory input. I think you are also considering such approaches within the broader scope of your work.

To me, the above benchmark does not smell like being about out-of-distribution problems anymore, it reminds me more of the problem of unsupervised learning, specifically the problem of clustering unlabelled data into distinct groups.

One (general but naive) way to compute the two desired classifiers would be to first take the unlabelled dataset and use unsupervised learning to classify it into 4 distinct clusters. Then, use the labelled data to single out the two clusters that also appear in the labelled dataset, or at least the two clusters that appear appear most often. Then, construct the two classifiers as follows. Say that the two groups also in the labelled data are cluster A, whose members mostly have the label happy, and cluster B, whose members mostly have the label sad. Call the remaining clusters C and D. Then the two classifiers are (A and C=happy, B and D = sad) and (A and D = happy, B and C = sad). Note that this approach will not likely win any benchmark contest, as the initial clustering step fails to use some information that is available in the labelled dataset. I mention it mostly because it highlights a certain viewpoint on the problem.

For better benchmark results, you need a more specialised clustering algorithm (this type is usually called Semi-Supervised Clustering I believe) that can exploit the fact that the labelled dataset gives you some prior information on the shapes of two of the clusters you want.

One might also argue that, if the above general unsupervised clustering based method does not give good benchmark results, then this is a sign that, to be prepared for every possible model split, you will need more than just two classifiers.

Comment by Koen.Holtman on Principles for Alignment/Agency Projects · 2022-07-07T18:19:35.799Z · LW · GW

Not sure what makes you think 'strawmen' at 2, but I can try to unpack this more for you.

Many warnings about unaligned AI start with the observation that it is a very bad idea to put some naively constructed reward function, like 'maximize paper clip production', into a sufficiently powerful AI. Nowadays on this forum, this is often called the 'outer alignment' problem. If you are truly worried about this problem and its impact on human survival, then it follows that you should be interested in doing the Hard Thing of helping people all over the world write less naively constructed reward functions to put into their future AIs.

John writes:

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things. [...] The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI [...]

This pattern of outsourcing the Hard Part to the AI is definitely on display when it comes to 2 above. Academic AI/ML research also tends to ignore this Hard Part entirely, and implicitely outsources it to applied AI researchers, or even to the end users.

Comment by Koen.Holtman on Principles for Alignment/Agency Projects · 2022-07-07T13:22:59.411Z · LW · GW

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

  1. Do something that will affect policy in a positive way

  2. Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

Comment by Koen.Holtman on Looking back on my alignment PhD · 2022-07-07T11:50:43.223Z · LW · GW

I have said nice things about AUP in the past (in past papers I wrote) and I will continue to say them. I can definitely see real-life cases where adding an AUP term to a reward function makes the resulting AI or AGI more aligned. Therefore, I see AUP as a useful and welcome tool in the AI alignment/safety toolbox. Sure, this tool alone does not solve every problem, but that hardly makes it a pointless tool.

From your off-the-cuff remarks, I am guessing that you are currently inhabiting the strange place where 'pivotal acts' are your preferred alignment solution. I will grant that, if you are in that place, then AUP might appear more pointless to you than it does to me.

Comment by Koen.Holtman on AGI Ruin: A List of Lethalities · 2022-06-08T23:40:54.106Z · LW · GW

IMO the biggest hole here is "why should a superhuman AI be extremely consequentialist/optimizing"?

I agree this is a very big hole. My opinion here is not humble. My considered opinion is that Eliezer is deeply wrong in point 23, on many levels. (Edited to add: I guess I should include an informative link instead of just expressing my disappointment. Here is my 2021 review of the state of the corrigibility field).

Steven, in response to your line of reasoning to fix/clarify this point 23: I am not arguing for pivotal acts as considered and then rejected by Eliezer, but I believe that he strongly underestimates the chances of people inventing safe and also non-consequentialist optimising AGI. So I disagree with your plausibility claim in point (3).

Comment by Koen.Holtman on AGI Ruin: A List of Lethalities · 2022-06-08T22:01:14.858Z · LW · GW

You are welcome. I carefully avoided mentioning my credentials as a rhetorical device.

I rank the credibility of my own informed guesses far above those of Eliezer.

This is to highlight the essence of how many of the arguments on this site work.

Comment by Koen.Holtman on AGI Ruin: A List of Lethalities · 2022-06-07T12:26:47.825Z · LW · GW

Why do you rate yourself "far above" someone who has spent decades working in this field?

Well put, valid question. By the way, did you notice how careful I was in avoiding any direct mention of my own credentials above?

I see that Rob has already written a reply to your comments, making some of the broader points that I could have made too. So I'll cover some other things.

To answer your valid question: If you hover over my LW/AF username, you can see that I self-code as the kind of alignment researcher who is also a card-carrying member of the academic/industrial establishment. In both age and academic credentials. I am in fact a more-senior researcher than Eliezer is. So the epistemology, if you are outside of this field and want to decide which one of us is probably more right, gets rather complicated.

Though we have disagreements, I should also point out some similarities between Eliezer and me.

Like Eliezer, I spend a lot of time reflecting on the problem of crafting tools that other people might use to improve their own ability to think about alignment. Specifically, these are not tools that can be used for the problem of triangulating between self-declared experts. They are tools that can be used by people to develop their own well-founded opinions independently. You may have noticed that this is somewhat of a theme in section C of the original post above.

The tools I have crafted so far are somewhat different from those that Eliezer is most famous for. I also tend to target my tools more at the mainstream than at Rationalists and EAs reading this forum.

Like Eliezer, on some bad days I cannot escape having certain feelings of disappointment about how well this entire global tool crafting project has been going so far. Eliezer seems to be having quite a lot of these bad days recently, which makes me feel sorry, but there you go.

Comment by Koen.Holtman on AGI Ruin: A List of Lethalities · 2022-06-06T21:04:01.613Z · LW · GW

Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]

I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future AGI technology, a technology nobody has seen yet.

The claimed truth of several of these 'true things' is often backed up by nothing more than Eliezer's best-guess informed-gut-feeling predictions about what future AGI must necessarily be like. These predictions often directly contradict the best-guess informed-gut-feeling predictions of others, as is admirably demonstrated in the 2021 MIRI conversations.

Some of Eliezer's best guesses also directly contradict my own best-guess informed-gut-feeling predictions. I rank the credibility of my own informed guesses far above those of Eliezer.

So overall, based on my own best guesses here, I am much more optimistic about avoiding AGI ruin than Eliezer is. I am also much less dissatisfied about how much progress has been made so far.

Comment by Koen.Holtman on AGI Ruin: A List of Lethalities · 2022-06-06T15:12:07.533Z · LW · GW

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.

I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.

Comment by Koen.Holtman on Announcing the Alignment of Complex Systems Research Group · 2022-06-05T11:20:32.112Z · LW · GW

If you’re interested in conceptual work on agency and the intersection of complex systems and AI alignment

I'm interested in this agenda, and I have been working on this kind of thing myself, but I am not interested at this time in moving to Prague. I figure that you are looking for people interested in moving to Prague, but if you are issuing a broad call for collaborators in general, or are thinking about setting up a much more distributed group, please clarify.

A more technical question about your approach:

What we’re looking for is more like a vertical game theory.

I'm not sure if you are interested in developing very generic kinds of vertical game theory, or in very specific acts of vertical mechanism design.

I feel that vertical mechanism design where some of the players are AIs is deeply interesting and relevant to alignment, much more so than generic game theory. For some examples of the kind of mechanism design I am talking about, see my post and related paper here. I am not sure if my interests make me a nearest neighbour of your research agenda, or just a very distant neighbour.

Comment by Koen.Holtman on Reshaping the AI Industry · 2022-06-03T14:20:42.048Z · LW · GW

There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that

The wider AI research community is an almost-optimal engine of apocalypse.


AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.

I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace of progress in alignment research.

This level of disagreement, which is almost inevitable as it involves estimates about about the future. has important implications for the problem of convincing people:

As per above, we'd be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren't convinced.

I'd say that you would indeed be facing an uphill battle, if you'd want to convince most researchers and managers that the recent late-stage Yudkowsky estimates about the inevitability of an AI apocalypse are correct.

The effective framing you are looking for, even if you believe yourself that Yudkowsky is fully correct, is that more work is needed on reducing long-term AI risks. Researchers and managers in the AI industry might agree with you on that, even if they disagree with you and Yudkowsky about other things.

Whether these researchers and managers will change their whole career just because they agree with you is a different matter. Most will not. This is a separate problem, and should be treated as such. Trying to solve both problems at once by making people deeply afraid about the AI apocalypse is a losing strategy.

Comment by Koen.Holtman on Would (myopic) general public good producers significantly accelerate the development of AGI? · 2022-03-10T13:25:10.818Z · LW · GW

What are some of those [under-produced software] components? We can put them on a list.

Good question. I don't have a list, just a general sense of the situation. Making a list would be a research project in itself. Also, different people here would give you different answers. That being said,

  • I occasionally see comments from alignment research orgs who do actual software experiments that they spend a lot of time on just building and maintaining the infrastructure to run large scale experiments. You'd have to talk to actual orgs to ask them what they would need most. I'm currently a more theoretical alignment researcher, so I cannot offer up-to-date actionable insights here.

  • As a theoretical researcher, I do reflect on what useful roads are not being taken, by industry and academia. One observation here is that there is an under-investment in public high-quality datasets for testing and training, and in the (publicly available) tools needed for dataset preparation and quality assurance. I am not the only one making that observation, see for example . Another observation is that everybody is working on open source ML algorithms, but almost nobody is working on open source reward functions that try to capture the actual complex details of human needs, laws, or morality. Also, where is the open source aligned content recommender?

  • On a more practical note, AI benchmarks have turned out to be a good mechanism for drawing attention to certain problems. Many feel that this benchmarks are having a bad influence on the field of AI, I have a lot of sympathy for that view, but you might also go with the flow. A (crypto) market that rewards progress on selected alignment benchmarks may be a thing that has value. You can think here of benchmarks that reward cooperative behaviour, truthfulness and morality in answers given by natural language querying systems, playing games ethically ( ), etc. My preference would be to reward benchmark contributions that win by building strong priors into the AI to guide and channel machine learning; many ML researchers would consider this to be cheating, but these are supposed to be alignment benchmarks, not machine-learning-from-blank-slate benchmarks. I have some doubts about the benchmarks for fairness in ML which are becoming popular, if I look at the latest NeurIPS: the ones I have seen offer tests which look a bit too easy, if the objective is to reward progress on techniques that have the promise of scaling up to more complex notions of fairness and morality you would like to have at the AGI level, or even for something like a simple content recommendation AI. Some cooperative behaviour benchmarks also strike me as being too simple, in their problem statements and mechanics, to reward the type of research that I would like to see. Generally, you would want to retire a benchmark from the rewards-generating market when the improvements on the score level out.