ELK First Round Contest Winners 2022-01-26T02:56:56.089Z
Apply for research internships at ARC! 2022-01-03T20:26:18.269Z
Prizes for ELK proposals 2022-01-03T20:23:25.867Z
Counterexamples to some ELK proposals 2021-12-31T17:05:10.515Z
ARC's first technical report: Eliciting Latent Knowledge 2021-12-14T20:09:50.209Z
ARC is hiring! 2021-12-14T20:09:33.977Z
Why I'm excited about Redwood Research's current project 2021-11-12T19:26:26.159Z
Comments on OpenPhil's Interpretability RFP 2021-11-05T22:36:04.733Z
EDT with updating double counts 2021-10-12T04:40:02.158Z
Secure homes for digital people 2021-10-10T15:50:02.697Z
Improving capital gains taxes 2021-07-09T05:20:05.294Z
How much chess engine progress is about adapting to bigger computers? 2021-07-07T22:35:29.245Z
Experimentally evaluating whether honesty generalizes 2021-07-01T17:47:57.847Z
paulfchristiano's Shortform 2021-06-29T01:33:14.099Z
Avoiding the instrumental policy by hiding information about humans 2021-06-13T20:00:51.597Z
Answering questions honestly given world-model mismatches 2021-06-13T18:00:08.396Z
A naive alignment strategy and optimism about generalization 2021-06-10T00:10:02.184Z
Teaching ML to answer questions honestly instead of predicting human answers 2021-05-28T17:30:03.304Z
Decoupling deliberation from competition 2021-05-25T18:50:03.879Z
Mundane solutions to exotic problems 2021-05-04T18:20:05.331Z
Low-stakes alignment 2021-04-30T00:10:06.163Z
AMA: Paul Christiano, alignment researcher 2021-04-28T18:55:39.707Z
Announcing the Alignment Research Center 2021-04-26T23:30:02.685Z
Another (outer) alignment failure story 2021-04-07T20:12:32.043Z
My research methodology 2021-03-22T21:20:07.046Z
Demand offsetting 2021-03-21T18:20:05.090Z
It’s not economically inefficient for a UBI to reduce recipient’s employment 2020-11-22T16:40:05.531Z
Hiring engineers and researchers to help align GPT-3 2020-10-01T18:54:23.551Z
“Unsupervised” translation as an (intent) alignment problem 2020-09-30T00:50:06.077Z
Distributed public goods provision 2020-09-26T21:20:05.352Z
Better priors as a safety problem 2020-07-05T21:20:02.851Z
Learning the prior 2020-07-05T21:00:01.192Z
Inaccessible information 2020-06-03T05:10:02.844Z
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z
Hedonic asymmetries 2020-01-26T02:10:01.323Z
Moral public goods 2020-01-26T00:10:01.803Z
Of arguments and wagers 2020-01-10T22:20:02.213Z
Prediction markets for internet points? 2019-10-27T19:30:00.898Z
AI alignment landscape 2019-10-13T02:10:01.135Z
Taxing investment income is complicated 2019-09-22T01:30:01.242Z
The strategy-stealing assumption 2019-09-16T15:23:25.339Z
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z
What failure looks like 2019-03-17T20:18:59.800Z
Security amplification 2019-02-06T17:28:19.995Z
Reliability amplification 2019-01-31T21:12:18.591Z
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z
Thoughts on reward engineering 2019-01-24T20:15:05.251Z
Learning with catastrophes 2019-01-23T03:01:26.397Z


Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-25T17:18:12.956Z · LW · GW
  1. "Bad reporter" = any reporter that gives unambiguously bad answers in some situations (in the ontology identification case, basically anything other than a direct translator)
  2. "use knowledge of direct translation" = it may be hard to learn direct translation because you need a bunch of parameters to specify how to do it, but these "bad" reporters may also need the same bunch of parameters (because they do direct translation in some situations)
  3. In the "upstream" counterexample, the bad reporter does direct translation under many circumstances but then sometimes uses a different heuristic that generates a bad answer. So the model needs all the same parameters used for direct translation, as mentioned in the last point. (I think your understanding of this was roughly right.)
  4. More like: now we've learned a reporter which contains what we want and also some bad stuff, you could imagine doing something like imitative generalization (or e.g. a different regularization scheme that jointly learned multiple reporters) in order to get just what we wanted.
Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-25T03:40:22.257Z · LW · GW

I'd like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:

  • Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q).
  • If language is rich enough, and we are precise enough with the formulation of questions, then you may hope that lots of other questions have different interpretations in W1 and W2, i.e. such that the simplest way of answering other questions will generalize correctly to Q.
  • In the case of amplification/debate, Q2 = "Does a human with AI assistants believe a diamond is in the room?" and so we can hope that in fact Q1 and Q2 have the same answers in all situations. (Though we aren't optimistic about this.)
Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-23T21:35:29.643Z · LW · GW

In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don't do that is because it's likely to be too big to make much sense of.

We don't quite have access to the AI Bayes net---we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as "inference in a Bayes net."

So ideally a solution would use neither the human Bayes net or the AI Bayes net.

But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we'd ultimately need to refine them into approaches that don't depend on having an explicit Bayes net).

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-21T05:41:33.950Z · LW · GW

We're going to accept submissions through February 10.

(We actually ended up receiving more submissions than I expected but it seems valuable, and Mark has been handling all the reviews, so running for another 20 days seems worthwhile.)

Comment by paulfchristiano on Alex Ray's Shortform · 2022-01-19T19:38:14.269Z · LW · GW

"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly?

My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.

I'm curious how this is connected to "doesn't write fiction where a human is harmed".

"Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it's plausible that a predicate with neutral valence would have been better to avoid confusion).

Comment by paulfchristiano on Alex Ray's Shortform · 2022-01-19T16:56:23.833Z · LW · GW

The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-17T05:07:14.903Z · LW · GW

I'd be fine with a proposal that flips coins and fails with small probability (in every possible world).

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-17T05:05:23.614Z · LW · GW

Yes, thanks!

Comment by paulfchristiano on The Solomonoff Prior is Malign · 2022-01-11T16:33:31.464Z · LW · GW

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?

Comment by paulfchristiano on The Solomonoff Prior is Malign · 2022-01-10T21:16:54.631Z · LW · GW

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

I'm not sure I understand what you mean by "decision-theoretic approach"

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

Comment by paulfchristiano on The Solomonoff Prior is Malign · 2022-01-10T16:33:59.885Z · LW · GW

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn't mention this because it doesn't seem like the way in which john is using "embededness;" for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I'd stand by saying that it doesn't appear to make the problem go away.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you've done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack.

Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-10T16:17:03.943Z · LW · GW

My guess is that "help humans improve their understanding" doesn't work anyway, at least not without a lot of work, but it's less obvious and the counterexamples get weirder.

It's less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like "human deliberation scaled up" to solve ELK, you probably just have to solve the whole (unlimited) problem along the way.

It seems to me like the core troubles with this point are:

  • You still have finite training data, and we don't have a scheme for collecting it. This can result in inner alignment problems (and it's not clear those can be distinguished from other problems, e.g. you can't avoid them with a low-stakes assumption).
  • It's not clear that HCH ever figures out all the science, no matter how much time the humans spend (and having a guarantee that you eventually figure everything out seems seems kind of close to ELK, where the "have AI help humans improve our understanding" is to some extent just punting to the humans+AI to figure out something).
  • Even if HCH were to work well it will probably be overtaken by internal consequentialists, and I'm not sure how to address that without competitiveness. (Though you may need a weaker form of competitiveness.)

I'm generally interested in crisper counterexamples since those are a bit of a mess.

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-10T04:57:31.493Z · LW · GW

I think this is an important family of counterexamples not really addressed in our report (and which we're not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we'd definitely give a $50k prize).

Some thoughts:

  • As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix.
  • Most importantly, we can use SGD to optimize "What is the probability I assign to answers that are consistent with other copies' answers?" instead of "What score do I get in the consistency game?". If this was the only difficulty, we'd want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I'm mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern / we know enough about ELK to be able to work productively on inner alignment.
  • If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I'm inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you'll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.)
  • On the other hand, if your model isn't doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it's not merely better in practice, it also seems like it gives you a large leg up with which to beat the model. In general, defending against a SmartVault which controls a ton of actuators seems harder than  defending against an adversary who can only use side channels.
  • I'm normally pretty opposed to sandboxing, but I do also think that if this was all you had to do, then it is actually plausible that you could literally win the game. E.g. if the overhead on homomorphic encryption was low enough (say 1000%) that you could afford to run each reporter training datapoint homomorphically encrypted, then it seems like you could win against any adversary. I still wouldn't bet on it without thinking a lot, but I think it makes the point that there is a big quantitative gap here.

Overall I agree that in the worst case this kind of counterexample applies to all of the proposals in this document. Moreover, I agree that unfortunately it is (i) a complicated aesthetic judgment about what kind of counterexample to focus on first / what is a prerequisite to what, and (ii) not obvious that you can define the problem narrowly enough to avoid this counterexample without also defining-away other difficulties we are trying to address. But I'm still wiling to bet on the aesthetic judgment and it does look like we can avoid this counterexample without accidentally defining away the whole problem.

(I don't think this comment covers the entire issue, I hope we'll write more about this in another report.)

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-09T19:35:24.264Z · LW · GW

As described in the report it would say "I'm not sure" when the human wasn't sure (unless you penalized that).

That said, often a human who looks at a sequence of actions would say "almost certainly the diamond is there." They might change their answer if you also told them "by the way these actions came from a powerful adversary trying to get you to think the diamond is there." What exactly the reporter says will depend on some details of e.g. how the reporter reasons about provenance.

But the main point is that in no case do you get useful information about examples that a human (with AI assistants) couldn't figure out what was happening on their own.

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-09T17:04:41.177Z · LW · GW

I haven't written any such articles. I definitely think it's promising.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2022-01-09T17:03:30.804Z · LW · GW

In all of the counterexamples the reporter starts from the , actions, and  predicted by the predictor. In order to answer questions it needs to infer the latent variables in the human's model.

Originally we described a counterexample where it copied the human inference process.

The improved counterexample is to instead use lots of computation to do the best inference it can, rather than copying the human's mediocre inference. To make the counterexample fully precise we'd need to specify an inference algorithm and other details.

We still can't do perfect inference though---there are some inference problems that just aren't computationally feasible.

(That means there's hope for creating data where the new human simulator does badly because of inference mistakes. And maybe if you are careful it will also be the case that the direct translator does better, because it effectively reuses the inference work done in the predictor? To get a proposal along these lines we'd need to describe a way to produce data that involves arbitrarily hard inference problems.)

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-07T17:08:33.533Z · LW · GW

In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

I think that it's more complicated to talk about what models "really know" as they get dumber, so we want to use very smart models to construct unambiguous counterexamples. I do think that the spirit of the problem applies even to very tiny models, and those are likely interesting.

(More precisely: it's always extremely subtle to talk about what models "know," but as models get smarter there are many more things that they definitely know so it's easier to notice if you are definitely failing. And the ELK problem statement in this doc is really focused on this kind of unambiguous failure, mostly as a methodological point but also partly because the cases where AI murders you also seems to involve "definitely knowing" in the same sense.)

I think my take is that for linear/logistic regression there is no latent knowledge, but even for a fully linear 3 layer neural network, or a 2 layer network solving many related problems, there is latent knowledge and an important conceptual question about what it means to "know what they know."

Comment by paulfchristiano on Prizes for ELK proposals · 2022-01-07T06:28:07.350Z · LW · GW

The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: "The AI model gets a higher score to the degree that people doing 'digital neuroscience' would have an easier time, and find more interesting things, probing its 'digital brain.'" So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can't easily be seen to correspond to any human-familiar concepts.

I think that a lot depends on what kind of term you include.

If you just say "find more interesting things" then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don't really have any candidates for defining that in a way that does what you want.

In some sense I think if the digital neuroscientists are good enough at their job / have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like "they don't think of anything clever." In general I'd be happy to talk about concrete proposals along these lines.

(I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an "interpreted" version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you'll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2022-01-04T15:58:50.182Z · LW · GW

Generally we are asking for an AI that doesn't give an unambiguously bad answer, and if there's any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn't unambiguously bad and we're fine if the AI gives it.

There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it's catastrophic for our AI not to make the "correct" judgment. I'm not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples.

For example, note that ELK is never trying to answer any questions of the form "how good is this outcome?"; I certainly agree that there can also be ambiguity about questions like "did the diamond stay in the room?" but it's a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient which gives a lot of examples of where we think we can/can't tolerate ambiguity, and to a lesser extent avoiding subtle manipulation which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.

Comment by paulfchristiano on Counterexamples to some ELK proposals · 2022-01-03T02:59:44.690Z · LW · GW

In some sense this is exactly what we want to do, and this is why we are happy with a very "narrow" version of ELK (see the appendices on narrow elicitation and why it might be enough, indirect normativity, and avoiding subtle manipulation).

But you still need to care about some sensor tampering. In particular, you need to make sure that there are actually happy humans deliberating about what to do (under local conditions that they believe are conducive to figuring out the answer), rather than merely cameras showing happy humans deliberating about what to do.

Comment by paulfchristiano on Counterexamples to some ELK proposals · 2022-01-02T17:54:04.900Z · LW · GW

If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.

In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer "what would an ideal sensor show?" seems to run into the same issues as answering "what's actually going on?" E.g. your supersensor idea #3 seems to be similar to the "human operates SmartVault and knows if tampering occurred" proposal we discussed here.

I do think that excising knowledge is a substantive change, I feel like it's effectively banking on "if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything." But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it's possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn't) and conservative bounds on what humans can know will still hit the one but not the other.

Comment by paulfchristiano on Counterexamples to some ELK proposals · 2022-01-02T16:29:33.287Z · LW · GW

I would describe the overall question as "Is there a situation where an AI trained using this approach deliberately murders us?" and for ELK more specifically as "Is there a situation where an AI trained using this approach gives an unambiguously wrong answer to a straightforward question despite knowing better?" I generally don't think that much about the complexity of human values.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2022-01-01T23:29:04.494Z · LW · GW

Thanks for the kind words (and proposal)!

There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty -- as applied to the SmartVault scenario, this might look like: "train lots of diverse mappings between the AI's ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human's ontology, try to figure out what's going on". IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts

I broadly agree that "train a bunch of models and panic if any of them say something is wrong." The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix.

Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn't seem all that common to do this, and it's not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.

We're imagining the case where the predictor internally performs inference in a learned model, i.e. we're not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don't think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases.

I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.

(We're actually planning to do  a narrower contest focused on ELK proposals.)

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2022-01-01T23:04:16.757Z · LW · GW

I don't think we have any kind of  precise definition of "no ambiguity." That said, I think it's easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren't present in a given situation.

In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail.

(That methodological point isn't obvious though---it may be that precise definitions are very useful for solving the problem even if you don't need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don't recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)

Comment by paulfchristiano on Eliciting Latent Knowledge Via Hypothetical Sensors · 2021-12-31T20:48:41.765Z · LW · GW

I think this is a good approach to consider, though I'm currently skeptical this kind of thing can resolve the worst case problem.

My main concern is that models won't behave well by default when we give them hypothetical sensors that they know don't exist (especially relevant for idea #3). But on the other hand, if we need to get good signal from the sensors that actually exist then it seems like we are back in the "build a bunch of sensors and hope it's hard to tamper with them all" regime. I wrote up more detailed thoughts in Counterexamples to some ELK proposals.

Other random thoughts:

  • Agreed you shouldn't just use cameras and should include all kinds of sensors (basically everything that humans can understand, including with the help of powerful AI assistants).
  • I basically think that "facts of the matter" are all we need, and if we have truthfulness about them then we are in business (because we can defer to future humans who are better equipped to evaluate hard moral claims).
  • I think "pivotal new sensors" is very similar to the core idea in Ramana and Davidad's proposals, so I addressed them all as a group in Counterexamples to some ELK proposals.
Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-30T21:17:04.621Z · LW · GW

I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn't be able to evaluate that answer.

Some thoughts on this setup:

  • I'm very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory.
  • There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don't think this is a fundamental complication, so from here on out I'll just talk about ELK for a single value function invocation. I don't think the problem is very interesting unless the AZ value function itself is much stronger than your humans.
  • Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I'm most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle).
  • If a human doesn't know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren't expecting/requiring ELK to do anything at all. So we'd like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they'd have a rich web of concepts so that the question feels analogous to the real world case, but I think it's interesting as long  they have anything.) We never expect it to walk us through pedagogy, and we're trying to define a utility function that also doesn't require pedagogy in the real world, i.e. that is defined in terms of familiar concepts. I think it would make sense to study how to get AZ to explain basic principles of Go to someone who lacks any relevant concepts, but I don't think it would be analogous to ELK in particular.
  • I think it's important that you have access to human explanations, or answers to questions, or discussions about what concepts mean. This is the only way you're anchoring the meaning of terms, and generally important for most of the approaches. This is a central part fo why we're only aiming at training the system to talk about human concepts.
  • I think it's important that AZELK is trained by humans who are much worse at Go than AZ. Otherwise it doesn't seem helpfully analogous to long-run problems. And I don't see much upside to doing such a project with experts rather than amateurs. I think that most likely you'd want to do it with Go amateurs (e.g. 10k or even weaker). It's possible that you need fairly weak humans before AZ actually has intuitions that the human couldn't arbitrate a debate about, but that would already be interesting to learn and so I wouldn't stress about it at first (and I would consider debate and amplification as "in bounds" until we could find some hard case where they failed, initial steps might not be analogous to the hardest parts of ELK but that's fine).
  • I don't expect AZELK to ever talk about why it chose a move or "what it's thinking" or so on---just to explain what it knows about the state of the board (and the states of the board it considered in its search and so on). I don't think it would be possible to detect a sabotaged version of the model.
  • You could imagine eliciting knowledge from a human expert. I think that most of the mechanisms would amount to clever incentives for compensating them. Again, I don't think the interesting part is understanding why they are making moves per se, it's just getting them to explain important facts about particular board states that you couldn't have figured out on your own. I think that many possible approaches to ELK won't be applicable to humans, e.g. you can't do regularization based on the structure of the model. Basically all you can do are behavioral incentives + applying time pressure, and that doesn't look like enough to solve the problem.

I think it's also reasonable to talk about ELK in various synthetic settings, or in the case of generative modeling (probably in domains where humans have a weak understanding). Board games seem useful because your AI can so easily be superhuman, but they can have problems because there isn't necessarily that much latent structure.

Comment by paulfchristiano on The Solomonoff Prior is Malign · 2021-12-29T02:59:45.292Z · LW · GW

We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can't do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)

Comment by paulfchristiano on Worst-case thinking in AI alignment · 2021-12-28T16:29:22.146Z · LW · GW

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just "solve existing problems and then try the same thing tomorrow," and generally I'd agree "solve an existing problem for which you can verify success" is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it's feasible.

Comment by paulfchristiano on The Solomonoff Prior is Malign · 2021-12-28T05:31:14.319Z · LW · GW

Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much

It seems like you can get malign behavior if you assume:

  1. There are some important decisions on which you can't get feedback.
  2. There are malign agents in the prior who can recognize those decisions.

In that case the malign agents can always defect only on important decisions where you can't get feedback.

I agree that if you can get feedback on all important decisions (and actually have time to recover from a catastrophe after getting the feedback) then malignness of the universal prior isn't important.

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully. For example, if you replace Solomonoff induction with a reflective oracle it seems like you have an identical problem, does that seem right to you? And similarly it seems like a creature who uses mathematical reasoning to estimate features of the universal prior would be vulnerable to similar pathologies even in a universe that is computable.

ETA: that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.

Comment by paulfchristiano on Reply to Eliezer on Biological Anchors · 2021-12-28T05:23:08.653Z · LW · GW

I'm surprised to see this bullet being bitten. I can easily think of trivial examples against the claim, where we know the minimum complexity of simple things versus their naïve implementations, but I'm not sure what arguments there are for it. It sounds pretty wild to me honestly, I have no intuition algorithmic complexity works anything like that.

I don't know what you mean by an "example against the claim." I certainly agree that there is often other evidence that will improve your bet. Perhaps this is a disagreement about the term "prima facie"? 

Learning that there is a very slow algorithm for a problem is often a very important indicator that a problem is solvable, and savings like  to  seem routine (and often have very strong similarities between the algorithms). And very often the running time of one algorithm is indeed a useful indicator for the running time of a very different approach. It's possible we are thinking about different domains here, I'm mostly thinking of traditional algorithms (like graph problems, approximation algorithms, CSPs, etc.) scaled to input sizes where the computation cost is in this regime. Though it seems like the same is also true for ML (though I have much less data there and moreover all the examples are radically less crisp).

The chance that a  parameter model unlocks AGI given a  parameter model doesn't is much larger than the chance that a  parameter model unlocks AGI given a  parameter model doesn't.

This seems wrong but maybe for reasons unrelated to the matter at hand. (In general an unknown number is much more likely to lie between  and  than between  and , just as an unknown number is more likely to lie between 11 and 16 than between 26 and 31.)

I'm also unclear whether you consider this a general rule of thumb for probabilities in general, or something specific to algorithms. Would you for instance say that if there was a weak proof that we could travel interstellar with Y times better fuel energy density, then there's a 50% chance that there's a method derived from that method for interstellar travel with just  times better energy density?

I think it's a good rule of thumb for estimating numbers in general.  If you know a number is between A and B (and nothing else), where A and B are on the order of , then a log-uniform distribution between A and B is a reasonable prima facie guess.

This holds whether the number is "The best you can do on the task using method X" or "The best you can do on the task using any method we can discover in 100 years" or "The best we could do on this task with a week and some duct tape" or "The mass of a random object in the universe."

Comment by paulfchristiano on Reply to Eliezer on Biological Anchors · 2021-12-28T05:10:48.375Z · LW · GW

If you think the probability derived from the upper limit set by evolutionary brute force should be spread out uniformly over the next 20 orders of magnitude, then I assume you think that if we bought 4 orders of magnitude today, there is a 20% chance that a method derived from evolutionary brute force will give us AGI? Whereas I would put that probability much lower, since brute force evolution is not nearly powerful enough at those scales.

I don't know what "derived from evolutionary brute force" means (I don't think anyone has said those words anywhere in this thread other than you?)

But in terms of P(AGI), I think that "20% for next 4 orders of magnitude" is a fine prima facie estimate if you bring in this single consideration and nothing else. Of course I don't think anyone would ever do that, but frankly I still think "20% for the next 4 orders of magnitude" is still better than most communities' estimates.

Comment by paulfchristiano on Should we rely on the speed prior for safety? · 2021-12-26T23:43:19.414Z · LW · GW

Minimal circuits are not quite the same as fastest programs---they have no adaptive computation, so you can't e.g. memorize a big table but only use part of it. In some sense it's just one more (relatively extreme) way of making a complexity-speed tradeoff  I basically agree that a GLUT is always faster than meta-learning if you have arbitrary adaptive computation.

That said, I don't think it's totally right to call a GLUT constant complexity---if you have an  bit input and  bit output, then it takes at least  operations to compute the GLUT (in any reasonable low-level model of computation). 

There are even more speed-focused methods than minimal circuits. I think the most extreme versions are more like query complexity or communication complexity, which in some sense are just asking how fast you can make your GLUT---can you get away with reading only a small set of input bits? But being totally precise about "fastest" requires being a lot more principled about the model of computation.

Comment by paulfchristiano on Reply to Eliezer on Biological Anchors · 2021-12-26T09:08:35.619Z · LW · GW

I think I endorse the general schema. Namely: if I believe that we can achieve X with  flops but not  flops (holding everything else constant), then I think that gives a prima facie reason to guess a 50% chance that we could achieve it with  flops.

(This isn't fully general, like if you told me that we could achieve something with  flops but not  I'd be more inclined to guess a median of  than .)

Comment by paulfchristiano on My Overview of the AI Alignment Landscape: Threat Models · 2021-12-26T07:08:18.090Z · LW · GW

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model)

I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.

To a lesser extent, "you get what you measure" may also be an obstacle to using AI systems to help us navigate complex challenges without quick feedback, like improving governance. But I don't think that's an x-risk in itself, more like a missed opportunity to do better. This is in the same category as e.g. failures of the education system, though it's plausibly better-leveraged if you have EA attitudes about AI being extremely important/leveraged. (ETA: I also view AI coordination, and differential capability progress, in a similar way.)

Comment by paulfchristiano on Reply to Eliezer on Biological Anchors · 2021-12-26T07:00:11.041Z · LW · GW

Hypothetically, if the Bio Anchors paper had claimed that strict brute-force evolution would take  ops instead of , what about your argument would actually change? It seems to me that none of it would, to any meaningful degree.

If there are 20 vs 40 orders of magnitude between "here" and "upper limit," then you end up with ~5% vs ~2.5% on the typical order of magnitude. A factor of 2 in probability seems like a large change, though I'm not sure what you mean by "to any meaningful degree."

It looks like the plausible ML extrapolations span much of the range from here to .  If we were in a different world where the upper bound was much larger, it would be more plausible for someone to think that the ML-based estimates are too low.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-24T05:58:25.092Z · LW · GW

I didn't follow some parts of the new algorithm. Probably most centrally: what is Dist(S)? Is this the type of distributions over real states of the world, and if so how do we have access to the true map Camera: S --> video? Based on that I likely have some other confusions, e.g. where are the camera_sequences and action_sequences coming from in the definition of Recognizer_M, what is the prior being used to define , and don't Recognizer_M and Recognizer_H effectively advance time a lot under some kind of arbitrary sequences of actions (making them unsuitable for exactly matching up states)?

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-24T05:39:38.758Z · LW · GW

Echoing Mark and Ajeya:

I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don't feel like it's quite right to frame it as "states" that the human does or doesn't understand. Instead we're thinking about properties of the world as being ambiguous or not in a given state.

As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom will be much more mixed up than that.

To give some more detail on my thoughts on state:

  • Obviously the human never knows the "real" state, which has a totally different type signature than their beliefs.
  • So it's natural to talk about knowing states based on correctly predicting what will happen in the future starting from that state. But it's ~never the case that the human's predictions about what will happen next are nearly as good as the predictor's.
  • We could try to say "you can make good predictions about what happens next for typical actions" or something, but even for typical actions the human predictions are bad relative to the predictor, and it's not clear in what sense they are "good" other than some kind of calibration condition.
  • If we imagine an intuitive translation between two models of reality, most "weird" states aren't outside of the domain of the translation, it's just that there are predictively important parts of the state that are obscured by the translation (effectively turning into noise, perhaps very surprising noise).

Despite all of that, it seems like it really is sometimes unambiguous to say "You know that thing out there in the world that you would usually refer to by saying 'the diamond is sitting there and nothing weird happened to it'? That thing which would lead you to predict that the camera will show a still frame of a diamond? That thing definitely happened, and is why the camera is showing a still frame of a diamond, it's not for some other reason."

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-20T16:23:32.470Z · LW · GW

Suppose the value head learns to predict "Will the human be confidently wrong about the outcome of this experiment," where an 'experiment' is a natural language description of a sequence of actions that the human could execute.  And then the experiment head produces natural language descriptions of actions that a human could take for which they'd be confidently wrong.

What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?

ETA: here's my best guess after reading the other comment---after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from "experiment that human would be confidently wrong about" since a human who doesn't understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?

If so it seems like there are a few problems:

  • The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).
  • If you tamper with the mechanism by which the human "executes" the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.
  • Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.
Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T22:17:13.264Z · LW · GW

That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like "zero incentive to tamper, and tampering seems complicated" fail here.

Even for "homeostatic" tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don't know whether that's true or whether it would be higher or lower than the difficulty of sensor tampering.

While some s may indeed predict this via reasoning, not all s that behave this way would, for example an  that internally modeled the tampering sequence of actions incorrectly as actually leading to  (and didn't even model a distinct ).

I agree that some M's would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M's couldn't lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M's also can't be expected to defend against tampering. So I don't currently think this is a big problem though I might well be missing something.

 would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that  predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in , or null my-actuators-are-disconnected actions) with higher probability in the future of  than in the future of  (making those states separable in ), or

I generally agree with this---in some sense this kind of "definitely no signals ever" tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering.

If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then  does get identified with , but some other action sequence (of similar length) would lead from  to , a state in which bizarre observations appear forever that would be extremely unlikely at any state in .

Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to  and hence we have no idea what is going on?)

One genre of ways this could fail involves using encryption or cryptographic hashes (e.g.  first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable. 

This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics).

My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead)

I'm less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don't have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor.  Here I am committing not to the hash of my successor, but to the hash of my actions, and so I can't easily circumvent the check.

I agree that regions of  that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to 's ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but

I generally agree that if we weren't worried about the kinds of considerations discussed in the rest of these comments, either because we addressed them or we were working in a methodology that was happy to set them aside as low probability, then it may well be possible to totally patch up these problems (and would be worth thinking about how to do so).

I generally think the family of approaches "This action is similar to something that demonstrably tampers" is very important to consider in practice (it has come up a number of times recently in discussions I've had with folks about more realistic failure stories and what you would actually do to avoid them). It may be more tampering-specific than addressing ELK, but for alignment overall that's fair game if it fixes the problems.

I'm a bit scared that every part of  is "close" to something that is not compatible with any real-world trajectory according to H.

(a) much less conservative than typical impact-measure penalties

Definitely agree with this.

(b) if  can learn what's going on with these regions of  and develop corresponding regions of , then the distance penalty would be replaced by 's informed evaluation of them.

I'm not sure I understand this 100%, but I'm interpreting it as an instance of a more general principle like: we could combine the mechanism we are currently discussing with all of the other possible fixes to ELK and tampering, so that this scheme only needs to handle the residual cases where humans can't understand what's going on at all even with AI assistance (and regularization doesn't work &c). But by that point maybe the counterexamples are rare enough that it's OK to just steer clear of them.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T22:05:12.069Z · LW · GW

This isn't clear to me, because "human imitation" here refers (I think) to "imitation of a human that has learned as much as possible (on the compute budget we have) from AI helpers." So as we pour more compute into the predictor, that also increases (right?) the budget for the AI helpers, which I'd think would make the imitator have to become more complex.

In the following section, you say something similar to what I say above about the "computation time" penalty ... I'm not clear on why this applies to the "computation time" penalty and not the complexity penalty

Yes, I agree that something similar applies to complexity as well as computation time. There are two big reasons I talk more about computation time:

  • It seems plausible we could generate a scalable source of computational difficulty, but it's less clear that there exists a scalable source of description complexity (rather than having some fixed upper bound on the complexity of "the best thing a human can figure out by doing science.")
  • I often imagine the assistants all sharing parameters with the predictor, or at least having a single set of parameters. If you have lots of assistant parameters that aren't shared with the predictor, then it looks like it will generally increase the training time a lot. But without doing that, it seems like there's not necessarily that much complexity the predictor doesn't already know about.
    (In contrast, we can afford to spend a ton of compute for each example at training time since we don't need that many high-quality reporter datapoints to rule out the bad reporters. So we can really have giant ratios between our compute and the compute of the model.)

But I don't think these are differences in kind and I don't have super strong views on this.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T21:59:55.525Z · LW · GW

For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking" benefit, driving the odds of success ever higher.

This is because of the remark on ensembling---as long as we aren't optimizing for scariness (or diversity for diversity's sake), it seems like it's way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply.

(Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.)

This is also true for data---I'd be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons).

It also seems like you believe that combining multiple regularizers would create a "stacking" benefit, driving the odds of success ever higher.

I believe this in a few cases (especially combining "compress the predictor," imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better.

Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I'd value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it's likely to be helpful in worlds where the previous ones are turning out unhelpful.

I don't think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I'm more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work.

I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we'd then want to give those as one input into that much messier process in order to think about what's really going to happen.

In this case, it seems like penalizing complexity, computation time, and 'downstream variables' (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)

I think this is plausible for complexity and to a lesser extent for computation time. I don't think it's very plausible for the most exciting regularizers, e.g. a good version of penalizing dependence on upstream nodes or the versions of computation time that scale best (and are really trying to incentivize the model to "reuse" inference that was done in the AI model). I think I do basically believe the arguments given in those cases, e.g. I can't easily see how translation into the human ontology can be more downstream than "use the stuff to generate observations then parse those observations."

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T18:54:43.399Z · LW · GW

Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI's current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn't need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/unfamiliar (from the AI's perspective) features that make it hard for the AI to predict correctly.

I'm imagining delegating to humans who are very similar to (and ideally indistinguishable from) the humans who will actually exist in the world that we bring about. I'm scared about very alien humans for a bunch of reasons---hard for the AI to reason about, may behave strangely, and makes it harder to use "corrigible" strategies to easily satisfy their preferences. (Though that said, note that the AI is reasoning very abstractly about such future humans and cannot e.g. predict any of their statements in detail.)

How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit's mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H's situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they're not living in the real world.

Ideally we are basically asking each human what they want their future to look like, not asking them to evaluate a very different world.

Ideally we would literally only be asking the humans to evaluate their future. This is kind of like giving instructions to their AI about what it should do next, but a little bit more indirect since they are instead evaluating futures that their AI could bring about.

The reason this doesn't work is that by the time we get to those future humans, the AI may already be in an irreversibly bad position (e.g. because it hasn't acquired much flexible influence that it can use to help the humans achieve their goals). This happens most obviously at the very end, but it also happens along the way if the AI failed to get into a position where it could effectively defend us. (And of course it happens along the way if people are gradually refining their understanding of what they want to happen in the external world, rather than having a full clean separation into "expand while protecting deliberation" + "execute payload.")

However, when this happens it is only because the humans along the way couldn't tell that things were going badly---they couldn't understand that their AI had failed to gather resources for them until they actually got to the end, asked their AI to achieve something, and were unhappy because it couldn't. If they had understood along the way, then they would never have gone down this route.

So at the point when the humans are thinking about this question, you may hope that they are actually ignorant about whether their AI has put them in a good situation. They are providing their views about what they want to happen in the world, hoping that their AI can achieve those outcomes in the world. The AI will only "back up" and explore a different possible future instead if it turns out that it isn't able to get the humans what they want as effectively as it would have been in some other world. But in this case the humans don't even know that this backing up is about to occur.  They never evaluate the full quality of their situation, they just say "In this world, the AI fails to do what they want" (and it becomes clear the situation is bad when in every world the AI fails to do what they want).

I don't really think the strong form of this can work out, since the humans may e.g. become wiser and realize that something in their past was bad. And if they are just thinking about their own lives they may not want to report that fact since it will clearly cause them not to exist. I think it's not really clear how to handle that.

(If the problem they notice was a fact about their early deliberation that they now regret then I think this is basically a problem for any approach. If they notice a fact about the AI's early behavior that they don't like, but they are too selfish to want to "unwind" it and therefore claim to be happy with what their AI does for them, then that seems like a more distinctive problem for this approach. More generally, there is a risk that people will be looking for any signs that a possible future is "their" future and preferring it, and that this effectively removes the ability to unwind and therefore eliminates the AI's incentive to acquire resources, and that we couldn't reintroduce it without giving up on the decoupling that lets us avoid incentives for manipulation.)

(I do think that issues like this are even more severe for many other approaches people are imagining to defining values, e.g. in any version of decoupled RL you could have a problem where overseers rate their own world much better than alternatives. You could imagine approaches that avoid this by avoiding indirect normativity, but currently it looks to me like they avoid problems only by being very vague about what "values" means.)

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T16:23:50.242Z · LW · GW

In general we don't have an explicit representation of the human's beliefs as a Bayes net (and none of our algorithms are specialized to this case), so the only way we are representing "change to Bayes net" is as "information you can give to a human that would lead them to change their predictions."

That said, we also haven't described any inference algorithm other than "ask the human." In general inference is intractable (even in very simple models), and the only handle we have on doing fast+acceptable approximate inference is that the human can apparently do it.

(Though if that was the only problem then we also expect we could find some loss function that incentivizes the AI to do inference in the human Bayes net.)

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T16:20:42.269Z · LW · GW

It will depend on how much much high-quality data you need to train the reporter. Probably it's a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-18T23:02:48.428Z · LW · GW

I'd be scared that the "Am I tricking you?" head just works by:

  1. Predicting what the human will predict
  2. Predicting what will actually happen
  3. Output a high value iff the human's prediction is confident but different from reality.

If this is the case, then the head will report detectable tampering but not undetectable tampering.

To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren't, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there's a big genre of proposals that try to leverage that kind of structure, which might be promising (though it's not the kind of thing I'm thinking about right now).

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-18T16:08:47.128Z · LW · GW

Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there's no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler's Bayes net?

Yes, that's the main way this could work.  The question is whether an AI understands things that humans can't understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly "until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses" (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter's job much harder and greatly restrict the space of cases where it fails to report tampering.

Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted.

This is also legitimately less obvious. An AI can't simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes "everything the human could understand with the amount of science they have time to do."

But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is "trying its best" to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant.

My guess is that this doesn't work on its own, but if you could scalably construct more complex data then it might work when combined with imitative generalization, as discussed here.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-17T16:58:47.735Z · LW · GW

I'm thinking of this in a family of proposals like:

  • Some kinds of tampering can be easily detected (and so should get identified with states  where tampering has occurred)
  • Some other tampering can't be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it.
  • In this case, we're going to try to exploit the fact that detectable tampering shares a prefix of actions/states with undetectable tampering (such that later states reached in that sequence have a much higher probability of detectable tampering, as well as undetectable tampering, and can be flagged as suspicious).

I currently don't believe this kind of thing can work (at least in the worst case), but I'm not at all confident.

One thing worth mentioning (which makes me much more optimistic about this class of proposals): if your AI, delegates to another AI who helps actually implement the tampering, that makes things harder but in some sense it also provides extra structure you can work with in a solution. Your AI needs to reason about that other AI's reasoning, and so you could hope (i) if your AI is reasoning about the other AI's reasoning in detail, then we can apply exactly the same machinery to that AI's predictive model in order to elicit it's beliefs (which your AI obviously trusts) about what is going on,  (ii) if your AI reasons abstractly about the other AI's reasoning, you could hope to treat that as a shortcut to reach the same conclusions and incentivize the model to give you the same answers that it would if it were reasoning explicitly (modulo its ability to figure it out).

(I also think the single-AI case basically needs to be treated in the same way as that---I don't think in general your AI is going to be reasoning explicitly about its own future actions, it just needs to reason abstractly about its future self in a qualitatively similar same way that it reasons about other AIs. Or at least in the worst case its reasoning about itself is just as messy and complex as its reasoning about other AIs.)

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-17T16:13:39.920Z · LW · GW

The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to .

The reason I'm concerned that it brings about this state is because the actual good state  is much harder to access than  (e.g. because it requires achieving hard real-world goals). The intuition is that  has constant difficulty while  gets harder and harder as we make the tasks more sophisticated.

I'm OK having the human look at entire trajectories, so it's provisionally OK (though not the most satisfying) if  gets identified with  as long as the preceding states look bad to the human and then they can do some kind of inference on top to figure out what happened. My main question is why those preceding states get mapped correctly.

If the preceding state  gets corrupted via a mechanism that isn't obvious to the human, and otherwise looks like a predecessor to , then it seems like it should get identified with a predecessor of 

The only way I see this working is if the AI can "abort" its plan at the last minute and demonstrate a hack.  But my main concerns are that (i) that doesn't seem true in general---the AI can tamper with sensors in ways humans don't understand starting from many initial states, and you can't distinguish the "almost-tampered" state naturally since it's just one more state that has this property, (ii) even if that happens it doesn't look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human's model, though if I wasn't worried about (i) I'd want to think about (ii) much more carefully and actually construct a counterexample.

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-17T05:14:02.207Z · LW · GW

Consider a state  where the sensors have been tampered with in order to "look like" the human state , i.e. we've connected the actuators and camera to a box which just simulates the human model (starting from ) and then feeding the predicted outputs of the human model to the camera.

It seems to me like the state  would have zero distance from the state  under all of these proposals. Does that seem right? (I didn't follow all of the details of the example, and definitely not the more general idea.)

(I first encountered this counterexample in Alignment for advanced machine learning systems. They express the hope that you can get around this by thinking about the states that can lead to the sensor-tampered state and making some kind of continuity assumption, but I don't currently think you can make that work and it doesn't look like your solution is trying to capture that intuition.)

Comment by paulfchristiano on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-17T05:06:29.281Z · LW · GW

The previous definition was aiming to define a utility function "precisely," in the sense of giving some code which would produce the utility value if you ran it for a (very, very) long time.

One basic concern with this is (as you pointed out at the time) that it's not clear that an AI which was able to acquire power would actually be able to reason about this abstract definition of utility. A more minor concern is that it involves considering the decisions of hypothetical humans very unlike those existing in the real world (who therefore might reach bad conclusions or at least conclusions different from ours).

In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.

Relatedly, this version only requires making predictions about humans who are living in the real world and being defended by their AI. (Though those humans can choose to delegate to some digital process making predictions about hypothetical humans, if they so desire.) Ideally I'd even like all of the humans involved in the process to be indistinguishable from the "real" humans, so that no human ever looks at their situation and thinks "I guess I'm one of the humans responsible for figuring out the utility function, since this isn't the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically."

More structurally, the goal is to define the utility function in terms of the kinds of question-answers that realistic approaches to ELK could elicit, which doesn't seem to include facts about mathematics that are much too complex for humans to derive directly and where they need to rely on correlations between mathematics and the physical world---in those cases we are essentially just delegating all the reasoning about how to couple them (e.g. how to infer that hypothetical humans will behave like real humans) to some amplified humans, and then we might as well go one level further and actually talk about how those humans reason.

The point of doing this exercise now is mostly to clarify what kind of answers we need to get out of ELK, and especially to better understand whether it's worth exploring "narrow" approaches (methodologically it may make sense anyway because they may be a stepping stone to more ambitious approaches, but it would be more satisfying if they could be used directly as a building block in an alignment scheme). We looked into it enough to feel more confident about exploring narrow approaches.