Posts

Imitative Generalisation (AKA 'Learning the Prior') 2021-01-10T00:30:35.976Z
Debate update: Obfuscated arguments problem 2020-12-23T03:24:38.191Z
Looking for adversarial collaborators to test our Debate protocol 2020-08-19T03:15:26.732Z
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z

Comments

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2021-01-13T01:35:47.827Z · LW · GW
In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might think this is a strategy that's possible for ML models, but that humans aren't naturally very good.

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)

Yeah, this is a great summary. One thing I would clarify is that it's sufficient that the set of arguments don't look suspicious to the judge. The arguments might look suspicious to the expert, but unless they have a way to explain to the judge why it's suspicious, we still have a problem.

If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify

Yeah, I think that is the obvious next step. The concern is that the reasons the argument is suspicious may be hard to justify in a debate, especially if they're reasons of the form 'look, I've done a bunch of physics problems, and approaching it this way feels like it will makes things messy, whereas approaching it this way feels cleaner'. Debate probably doesn't work very well for supervising knowledge that's gained through finding patterns in data, as opposed to knowledge that's gained through step-by-step reasoning. Something like imitative generalisation (AKA 'learning the prior') is trying to fill this gap.

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2021-01-13T01:25:09.389Z · LW · GW

When you say 'this approach', what are you referring to?

Comment by beth-barnes on Imitative Generalisation (AKA 'Learning the Prior') · 2021-01-11T20:42:37.818Z · LW · GW
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements - this is the main reason that we need science.

Agree that humans are not necessarily great at assigning priors. The main response to this is that we don't have a way to get better priors than an amplified human's best prior. If amplified humans think the NN prior is better than their prior, they can always just use this prior. So in theory this should be both strictly better than the alternative, and the best possible prior we can use.

Science seems like it's about collecting more data and measuring the likelihood, not changing the prior. We still need to use our prior - there are infinite scientific theories that fit the data, but we prefer ones that are simple and elegant.

z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can't calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.

One thing that helps a bit here is that we can use an amplified human. We also don't need the human to calculate the prior directly, just to do things like assess whether some change makes the prior better or worse. But I'm not sure how much of a roadblock this is in practice, or what Paul thinks about this problem.

Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation?

Yeah, the important difference is that in this case there's nothing that constrains the explanations to be the same as the actual reasoning the oracle is using, so the explanations you're getting are not necessarily predictive of the kind of generalisation that will happen. In IG it's important that the quality of z is measured by having humans use it to make predictions.

An agent can "generalise badly" because it's not very robust, or because it's actively pursuing goals that are misaligned with those of humans. It doesn't seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?

I'm not sure exactly what you're asking. I think the proposal is motivated by something like: having the task be IID/being able to check arbitrary outputs from our model to make sure it's generalising correctly buys us a lot of safety properties. If we have this guarantee, we only have to worry about rare or probabilistic defection, not that the model might be giving us misleading answers for every question we can't check.

Comment by beth-barnes on Debate Minus Factored Cognition · 2021-01-06T07:07:34.484Z · LW · GW

Thanks for the post, I'm excited that you're thinking about debate!

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 
Basically, it sounds like you're saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we're providing, then we're going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

It sounds like you're saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can't assume this, that presumably means that some reasonable fraction of all claims being made are dishonest (because if there were only a few dishonest claims, then they'd have honest defeaters and we'd have a clear training signal away from dishonesty, so after training for a bit we'd be able to trust the lower claims). This probably means that most debates will give us a bad answer (as you only need a few bad claims to invalidate the whole tree).  At this point, debate isn't really competitive, because it gives us dud answers almost all the time, and we're going to have to run an exponential number of debates before we happen on a correct one.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they're bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

But also, the 'amplified judge consulting sub-debates' sounds like it's just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree. 

Comment by beth-barnes on Debate Minus Factored Cognition · 2021-01-06T06:53:07.313Z · LW · GW

The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.  

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2021-01-06T06:39:50.109Z · LW · GW

To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2021-01-06T06:36:31.062Z · LW · GW

I'd be interested to hear more detail of your thoughts on how we might use robustness techniques!

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2020-12-27T23:31:10.487Z · LW · GW

Yep, planning to put up a post about that soon. The short argument is something like:
The equivalent of an obfuscated argument for IDA is a decomposition that includes questions the model doesn't know how to answer. 
We can't always tell the difference between an IDA tree that uses an obfuscated decomposition and gets the wrong answer, vs an IDA tree that uses a good decomposition and gets the right answer, without unpacking the entire tree

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2020-12-24T03:33:34.630Z · LW · GW

I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you're going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time

Comment by beth-barnes on Debate update: Obfuscated arguments problem · 2020-12-23T23:30:07.948Z · LW · GW

Thanks!

Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped. 

Comment by beth-barnes on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-19T00:33:21.454Z · LW · GW

One counterexample is Manhattan Project - they developed two different designs simultaneously because they weren't sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.
https://en.wikipedia.org/wiki/Manhattan_Project#:~:text=The%20Manhattan%20Project%20was%20a,Tube%20Alloys%20project)%20and%20Canada.

Comment by beth-barnes on AI safety via market making · 2020-11-22T06:04:54.031Z · LW · GW

Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that's supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

Comment by beth-barnes on Learning Normativity: A Research Agenda · 2020-11-18T19:31:29.306Z · LW · GW

I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we're correct about the hope that natural language is sufficiently universal. It's quite likely I'm either confused or being sloppy though.

You could put 'learning the prior' inside HCH I think, it would just be inefficient - for every claim, you'd ask your HCH tree how much you should believe it, and HCH would think about the correct way to do bayesian reasoning, what the prior on that claim should be, and how well it predicted every piece of data you'd seen so far, in conjunction with everything else in your prior. I think one view of learning the prior is just making this process more tractable/practical, and saving you from having to revisit all your data points every time you ask any question - you just do all the learning from data once, then use the result of that to answer any subsequent questions.

Comment by beth-barnes on Learning Normativity: A Research Agenda · 2020-11-18T08:08:30.152Z · LW · GW

However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there... but as I've mentioned earlier, human imitation does not get us all the way. Humans don't perfectly endorse their own reactions.

Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human - e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and decide how much they endorse it.
I think the target is not to get normativity 'correct', but to design a mechanism such that we can't expect to find any mechanism that does better.

Comment by beth-barnes on Extortion beats brinksmanship, but the audience matters · 2020-11-18T06:22:30.321Z · LW · GW

FYI/nit: at first glance I thought extorsion was supposed to mean something different from extortion (I've never seen it spelt with the s) and this was a little confusing. 

Comment by beth-barnes on AI safety via market making · 2020-11-18T06:18:26.697Z · LW · GW

Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it. 
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node. 
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?

 

Comment by beth-barnes on AI safety via market making · 2020-11-16T21:32:19.053Z · LW · GW

Suppose by strong induction that  always gives the right answer immediately for all sets of size less than 

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n. 

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

Ie, suppose M gives correct answers some fraction of the time. Are these answers going to get lower loss? As n gets large, the chance that M has made a mistake somewhere in the recursion chain gets large, and the correct answer is not necessarily rewarded. 

Comment by beth-barnes on AI safety via market making · 2020-11-16T21:20:08.853Z · LW · GW

I think for debate you can fix the circular argument problem by requiring debaters to 'pay' (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don't resolve before the depth limit, then they'll lose.  

Comment by beth-barnes on AGI safety from first principles: Goals and Agency · 2020-10-17T00:30:43.211Z · LW · GW

But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.
 

I think the ways in which humans are not fully consequentialist is much broader - we often do things because of habit, instinct, because doing that thing feels rewarding itself, because we're imitating someone else, etc. 

Comment by beth-barnes on Looking for adversarial collaborators to test our Debate protocol · 2020-08-19T18:24:08.185Z · LW · GW

Yep, or in comments. Thanks!

Comment by beth-barnes on Writeup: Progress on AI Safety via Debate · 2020-08-19T03:32:52.050Z · LW · GW

That's correct about simultaneity.

Yeah, the questions and answers can be arbitrary, doesn't have to be X and ¬X.

I'm not completely sure whether Scott's method would work given how we're defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner 'is your question equivalent to 'should I eat potatoes tonight?'', they wouldn't know. On the other hand, you could ask them ' if I think you should eat potatoes tonight, is your question equivalent to 'should I eat potatoes tonight?''. This would work as long as you were referring only to what one debater believed you should eat tonight, I think.

I feel fairly ok about this as a way to define the meaning of questions written by debaters within the debate. I'm less sure about how to define the top-level question. It seems like there's only really one question, which is 'what should I do?', and it's going to have to be defined by how the human asker clarifies their meaning. I'm not sure whether the meaning of the question should be allowed to include things the questioner doesn't know at the time of asking.

Comment by beth-barnes on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-23T03:34:05.605Z · LW · GW

Yeah I also thought this might just be true already, for similar reasons

Comment by beth-barnes on $1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is · 2020-07-21T20:22:45.926Z · LW · GW

Of course GPT-3 isn't aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it's imitating something where someone says something smart, it will "try" to say something smart. This question seems weird to me, Am I missing something?

Comment by beth-barnes on Tessellating Hills: a toy model for demons in imperfect search · 2020-03-12T06:03:07.969Z · LW · GW

I have the same confusion

Comment by beth-barnes on Using vector fields to visualise preferences and make them consistent · 2020-03-05T06:48:29.335Z · LW · GW

You might find this paper interesting. It does a similar decomposition with the dynamics of differentiable games (where the 'preferences' for how to change your strategy may not be the gradient of any function)

https://arxiv.org/abs/1802.05642

"The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems."