Posts

What does davidad want from «boundaries»? 2024-02-06T17:45:42.348Z
Does davidad's uploading moonshot work? 2023-11-03T02:21:51.720Z
A list of core AI safety problems and how I hope to solve them 2023-08-26T15:12:18.484Z
Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs 2023-07-22T18:09:03.816Z
An Open Agency Architecture for Safe Transformative AI 2022-12-20T13:04:06.409Z
AI Neorealism: a threat model & success criterion for existential safety 2022-12-15T13:42:11.072Z
Side-channels: input versus output 2022-12-12T12:32:29.042Z
Reframing inner alignment 2022-12-11T13:53:23.195Z
You can still fetch the coffee today if you're dead tomorrow 2022-12-09T14:06:48.442Z
Cryptoepistemology 2022-02-24T20:34:32.790Z
The Promise and Peril of Finite Sets 2021-12-10T12:29:56.535Z
davidad's Shortform 2021-12-09T18:16:57.484Z
Why I Moved from AI to Neuroscience, or: Uploading Worms 2012-04-13T07:10:32.377Z

Comments

Comment by davidad on AI Assistants Should Have a Direct Line to Their Developers · 2024-12-28T17:26:14.546Z · LW · GW

Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.

This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.

Comment by davidad on Proveably Safe Self Driving Cars [Modulo Assumptions] · 2024-10-04T14:18:19.914Z · LW · GW

Yes, I am indeed thinking about this.

  1. Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
  2. It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
  3. If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”) and the user must prove that the queries are invariant under (aka “respect”) the equivalence relation.
  4. But for most real-world problems there are many equally specification-satisfying strategies that are not behaviourally equivalent. The next hardest thing to do is to establish an upper bound on the log-cardinality of the set of behavioural-equivalence-classes that satisfy the specification. We want to keep the channel capacity (in bits per day) available for steganography, to humans or insecure cyber-infrastructure or biosequence synthesis, to a very low threshold (maybe 2 bits per day is okay).
  5. Often there will be too many degrees of freedom for this, so instead we need to do entropy regularization, which can be done elegantly with GFlowNets. I have not worked out all the details here but the hope is that we can derive a sound bound or PAC estimate of the KL divergence of the observable behaviour (which is sampled from the policy distribution dependent upon a stream of certified randomness) from the Frechet mean/centroid of all correct behaviours. This quantity would provide an upper bound on how much information the optimization process inside the box can transmit to the outside by having selecting a biased policy distribution.
Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2024-09-15T14:40:43.674Z · LW · GW

Nice, thanks for the pointer!

Comment by davidad on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems · 2024-05-20T16:53:30.813Z · LW · GW

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.

As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.

Comment by davidad on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems · 2024-05-20T16:37:47.749Z · LW · GW

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).

It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.

Comment by davidad on Linear infra-Bayesian Bandits · 2024-05-16T07:37:30.259Z · LW · GW

Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2024-05-03T05:40:47.632Z · LW · GW

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

Comment by davidad on Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis · 2024-02-05T07:28:37.437Z · LW · GW

Yes. You will find more details in his paper, Provably safe systems with Steve Omohundro, in which I am listed in the acknowledgments (under my legal name, David Dalrymple).

Max and I also met and discussed the similarities in advance of the AI Safety Summit in Bletchley.

Comment by davidad on Uncertainty in all its flavours · 2024-01-12T18:16:19.077Z · LW · GW

I agree that each of and has two algebraically equivalent interpretations, as you say, where one is about inconsistency and the other is about inferiority for the adversary. (I hadn’t noticed that).

The variant still seems somewhat irregular to me; even though Diffractor does use it in Infra-Miscellanea Section 2, I wouldn’t select it as “the” infrabayesian monad. I’m also confused about which one you’re calling unbounded. It seems to me like the variant is bounded (on both sides) whereas the variant is bounded on one side, and neither is really unbounded. (Being bounded on at least one side is of course necessary for being consistent with infinite ethics.)

Comment by davidad on Agent membranes/boundaries and formalizing “safety” · 2024-01-07T00:33:24.294Z · LW · GW

These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To "pierce" a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the boundary).

So, to your particular cases:

  1. Probably not. There is no fundamental difference between sound and contact. Rather, the fundamental difference is between the usual flow of information through the senses and other flows of information that are possible in the concrete physical world-model but not represented in the abstraction. An interaction that pierces the membrane is one which breaks the abstraction barrier of perception. Ordinary speech acts do not. Only sounds which cause damage (internal state changes that are not well-modelled as mental states) or which otherwise exceed the "operating conditions" in the state space of the «boundary» layer (e.g. certain kinds of superstimuli) would pierce the «boundary».
  2. Almost surely not. This is why, as an agenda for AI safety, it will be necessary to specify a handful of constructive goals, such as provision of clean water and sustenance and the maintenance of hospitable atmospheric conditions, in addition to the «boundary»-based safety prohibitions.
  3. Definitely not. Omission of beneficial actions is not a counterfactual impact.
  4. Probably. This causes prediction error because the abstraction of typical human spatial positions is that they have substantial ability to affect their position between nearby streets by simple locomotory action sequences. But if a human is already effectively imprisoned, then adding more concrete would not create additional/counterfactual prediction error.
  5. Probably not. Provision of resources (that are within "operating conditions", i.e. not "out-of-distribution") is not a «boundary» violation as long as the human has the typical amount of control of whether to accept them.
  6. Definitely not. Exploiting behavioural tendencies which are not counterfactually corrupted is not a «boundary» violation.
  7. Maybe. If the ad's effect on decision-making tendencies is well modelled by the abstraction of typical in-distribution human interactions, then using that channel does not violate the «boundary». Unprecedented superstimuli would, but the precedented patterns in advertising are already pretty bad. This is a weak point of the «boundaries» concept, in my view. We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all: any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected. Another approach is Mariven's criterion for deception, but applying this criterion requires modelling human mental states as beliefs about the world (which is certainly not 100% scientifically accurate). I would like to see more work here, and more different proposed approaches.
Comment by davidad on Safety First: safety before full alignment. The deontic sufficiency hypothesis. · 2024-01-06T23:49:07.290Z · LW · GW

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

Comment by davidad on Uncertainty in all its flavours · 2024-01-03T04:09:03.188Z · LW · GW

Kosoy's infrabayesian monad  is given by 

There are a few different varieties of infrabayesian belief-state, but I currently favour the one which is called "homogeneous ultracontributions", which is "non-empty topologically-closed ⊥–closed convex sets of subdistributions", thus almost exactly the same as Mio-Sarkis-Vignudelli's "non-empty finitely-generated ⊥–closed convex sets of subdistributions monad" (Definition 36 of this paper), with the difference being essentially that it's presentable, but it's much more like  than .

I am not at all convinced by the interpretation of  here as terminating a game with a reward for the adversary or the agent. My interpretation of the distinguished element  in  is not that it represents a special state in which the game is over, but rather a special state in which there is a contradiction between some of one's assumptions/observations. This is very useful for modelling Bayesian updates (Evidential Decision Theory via Partial Markov Categories, sections 3.5-3.6), in which some variable  is observed to satisfy a certain predicate : this can be modelled by applying the predicate in the form  where  means the predicate is false, and   means it is true. But I don't think there is a dual to logical inconsistency, other than the full set of all possible subdistributions on the state space. It is certainly not the same type of "failure" as losing a game.

Comment by davidad on Uncertainty in all its flavours · 2024-01-03T03:28:29.722Z · LW · GW

Does this article have any practical significance, or is it all just abstract nonsense? How does this help us solve the Big Problem? To be perfectly frank, I have no idea. Timelines are probably too short agent foundations, and this article is maybe agent foundations foundations...

I do think this is highly practically relevant, not least of which because using an infrabayesian monad instead of the distribution monad can provide the necessary kind of epistemic conservatism for practical safety verification in complex cyber-physical systems like the biosphere being protected and the cybersphere being monitored. It also helps remove instrumentally convergent perverse incentives to control everything.

Comment by davidad on Uncertainty in all its flavours · 2024-01-03T02:50:16.249Z · LW · GW

Meyer's

If this is David Jaz Myers, it should be "Myers' thesis", here and elsewhere

Comment by davidad on Does davidad's uploading moonshot work? · 2023-11-04T02:42:42.388Z · LW · GW

I have said many times that uploads created by any process I know of so far would probably be unable to learn or form memories. (I think it didn't come up in this particular dialogue, but in the unanswered questions section Jacob mentions having heard me say it in the past.)

Eliezer has also said that makes it useless in terms of decreasing x-risk. I don't have a strong inside view on this question one way or the other. I do think if Factored Cognition is true then "that subset of thinking is enough," but I have a lot of uncertainty about whether Factored Cognition is true.

Anyway, even if that subset of thinking is enough, and even if we could simulate all the true mechanisms of plasticity, then I still don't think this saves the world, personally, which is part of why I am not in fact pursuing uploading these days.

Comment by davidad on RSPs are pauses done right · 2023-10-14T18:32:45.936Z · LW · GW

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

Comment by davidad on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-09-18T17:20:03.527Z · LW · GW

I like the idea of trying out H-JEPA with GFlowNet actors.

I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-09-04T19:06:59.104Z · LW · GW

Yes, it's the latter. See also the Open Agency Keyholder Prize.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-28T15:16:29.873Z · LW · GW

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-28T15:10:29.339Z · LW · GW

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11.

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-28T10:39:23.455Z · LW · GW

It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-28T00:49:17.314Z · LW · GW

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-28T00:39:10.628Z · LW · GW

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to justify confidence in the proposed policy network, in which case the policy would not exit the box, and the failure is a flop rather than a foom.)

Perhaps the keyphrase you’re missing is “interval MDP abstraction”. One specific paper that combines RL and model-checking and coarse-graining in the way you’re asking for is Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-27T06:21:14.279Z · LW · GW

Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.

We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-27T06:16:00.784Z · LW · GW

There’s something to be said for this, because with enough RLHF, GPT-4 does seem to have become pretty corrigible, especially compared to Bing Sydney. However, that corrigible persona is probably only superficial, and the larger and more capable a single Transformer gets, the more of its mesa-optimization power we can expect will be devoted to objectives which are uninfluenced by in-context corrections.

Comment by davidad on A list of core AI safety problems and how I hope to solve them · 2023-08-26T23:06:11.883Z · LW · GW

A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.

Comment by davidad on Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs · 2023-07-23T08:29:07.774Z · LW · GW

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Comment by davidad on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-22T19:15:25.601Z · LW · GW

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

Comment by davidad on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-07-22T18:05:38.094Z · LW · GW

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

Comment by davidad on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-22T18:02:04.720Z · LW · GW

I think formal verification belongs in the "requires knowing what failure looks like" category.

For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don't think it requires understanding "how failure happens".

Comment by davidad on You can still fetch the coffee today if you're dead tomorrow · 2023-06-29T11:58:45.584Z · LW · GW

Yes—assuming that the pause interrupts any anticipatory gradient flows from the continuing agent back to the agent which is considering whether to pause.

This pattern is instantiated in the Open Agency Architecture twice:

  1. Step 2 generates top-level agents which are time-bounded at a moderate timescale (~days), with the deliberation about whether to redeploy a top-level agent being carried out by human operators.
  2. In Step 4, the top-level agent dispatches most tasks by deploying narrower low-level agents with much tighter time bounds, with the deliberation about whether to redeploy a low-level agent being automated by the top-level model.
Comment by davidad on Is Deontological AI Safe? [Feedback Draft] · 2023-06-12T17:43:04.804Z · LW · GW

For what it's worth, the phrase "night watchman" as I use it is certainly downstream of Nozick's concept.

Comment by davidad on Steering GPT-2-XL by adding an activation vector · 2023-05-17T18:33:58.084Z · LW · GW

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

Comment by davidad on Steering GPT-2-XL by adding an activation vector · 2023-05-15T13:25:33.149Z · LW · GW

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

Comment by davidad on «Boundaries» for formalizing an MVP morality · 2023-05-13T19:45:09.219Z · LW · GW

Thanks for bringing all of this together - I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!

Comment by davidad on An Open Agency Architecture for Safe Transformative AI · 2023-04-21T21:08:07.480Z · LW · GW

I agree that we should start by trying this with far simpler worlds than our own, and with futarchy-style decision-making schemes, where forecasters produce extremely stylized QURI-style models that map from action-space to outcome-space while a broader group of stakeholders defines mappings from output-space to each stakeholder’s utility.

Comment by davidad on Why Are Maximum Entropy Distributions So Ubiquitous? · 2023-04-11T19:30:44.915Z · LW · GW

Every distribution (that agrees with the base measure about null sets) is a Boltzmann distribution. Simply define , and presto, .

This is a very useful/important/underrated fact, but it does somewhat trivialize “Boltzmann” and “maximum entropy” as classes of distributions, rather than as certain ways of looking at distributions.

A related important fact is that temperature is not really a physical quantity, but is: it’s known as inverse temperature or . (The nonexistence of zero-temperature systems, the existence of negative-temperature systems, and the fact that negative-temperature systems intuitively seem extremely high energy bear this out.)

Comment by davidad on Practical Pitfalls of Causal Scrubbing · 2023-03-27T22:06:14.980Z · LW · GW

Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph is deterministic, the of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to "thicken" the distributions by adding absolutely-continuous noise to the input and/or output.

Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case does seem reasonable. Which does seem like a special case of the following sentence where you suggest using the original loss function but substituting the unablated model for the supervision targets—that also seems like a good summary statistic to look at.

Comment by davidad on Practical Pitfalls of Causal Scrubbing · 2023-03-27T21:48:04.645Z · LW · GW

As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.

Comment by davidad on Behavioral and mechanistic definitions (often confuse AI alignment discussions) · 2023-02-21T04:59:31.410Z · LW · GW

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

Comment by davidad on Assigning Praise and Blame: Decoupling Epistemology and Decision Theory · 2023-01-27T19:43:28.933Z · LW · GW

For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.

Comment by davidad on The Alignment Problem from a Deep Learning Perspective (major rewrite) · 2023-01-17T23:30:49.371Z · LW · GW

I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.

  1. I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally represented goals”, calling the other one safer—or different at all, from a safety perspective—would be a mistake. (And, in the long run, a mistake that opens Goodhartish loopholes for models to avoid having the internal properties we don’t like without actually being safer.)

  2. The fix is, roughly, to identify the “internal representations” in any suitable causal explanation of the extensionally observable behavior, including but not limited to a causal explanation whose variables correspond naturally to the “actual” computational graph that implements the policy/model.

  3. If we can identify internally represented [goals, etc] in the actual computational graph, that is of course strong evidence of (and in the limit of certainty and non-approximation, logically implies) internally represented goals in some suitable causal explanation. But the converse and the inverse of that implication would not always hold.

  4. I believe many non-safety ML people have a suspicion that safety people are making a vaguely superstitious error by assigning such significance to “internal representations”. I don’t think they are right exactly, but I think this is the steelman, and that if you can adjust to this perspective, it will make those kinds of critics take a second look. Extensional properties that scientists give causal interpretations to are far more intuitively “real” to people with a culturally stats-y background than supposedly internal/intensional properties.

Sorry these comments come at the last day; I wish I had read it in more depth a few days ago.

Overall the paper is good and I’m glad you’re doing it!

Comment by davidad on World-Model Interpretability Is All We Need · 2023-01-17T18:20:16.917Z · LW · GW

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

Comment by davidad on World-Model Interpretability Is All We Need · 2023-01-17T18:16:32.360Z · LW · GW

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.

Comment by davidad on Categorizing failures as “outer” or “inner” misalignment is often confused · 2023-01-07T04:40:54.329Z · LW · GW

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

  • you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
  • your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.

I tend to favor the latter interpretation—I'd say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have "programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon" is the decomposition into Reward Specification and Adequate Policy Learning really useful.

Comment by davidad on Side-channels: input versus output · 2022-12-30T02:07:59.188Z · LW · GW

I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

Comment by davidad on Side-channels: input versus output · 2022-12-28T17:03:35.399Z · LW · GW

I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

Comment by davidad on An Open Agency Architecture for Safe Transformative AI · 2022-12-23T22:36:14.490Z · LW · GW

The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall.

Regarding the challenges of understanding formal descriptions, I’m hopeful about this because of

  • natural abstractions (so the best formal representations could be shockingly compact)
  • code review (Google’s codebase is not exactly “a formal property,” unless we play semantics games, but it is highly reliable, fully machine-readable, and every one of its several billion lines of code has been reviewed by at least 3 humans)
  • AI assistants (although we need to be very careful here—e.g. reading LLM outputs cannot substitute for actually understanding the formal representation since they are often untruthful)
Comment by davidad on An Open Agency Architecture for Safe Transformative AI · 2022-12-23T22:28:17.168Z · LW · GW

Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model?

Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.

Comment by davidad on An Open Agency Architecture for Safe Transformative AI · 2022-12-23T22:23:52.548Z · LW · GW

In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):

  1. Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire network, not just the most interpretable parts. If it proves a property, that property is true about the actual neural network. The Model-Checking Feasibility Hypothesis says that this is feasible, regardless of the infeasibility of a human understanding the policy or any details of the proof. (We would rely on a verified verifier for the proof, of which humans would understand the details.)
  2. Factoring learned information through human understanding. If we denote learning by , human understanding by , and big effects on the world by , then “factoring” means that (for some and ). This is in the same spirit as “human in the loop,” except not for the innermost loops of real-time action. Here, the Scientific Sufficiency Hypothesis implies that even though is “larger” than in the sense you point out, we can throw away the parts that don’t fit in and move forward with a fully-understood world model. I believe this is likely feasible for world models, but not for policies (optimal policies for simple world models, like Go, can of course be much better than anything humans understand).