Reverse-engineering using interpretability 2021-12-29T23:21:14.328Z
Risks from AI persuasion 2021-12-24T01:48:17.231Z
Some thoughts on why adversarial training might be useful 2021-12-08T01:28:22.974Z
Considerations on interaction between AI and expected value of the future 2021-12-07T02:46:19.215Z
More detailed proposal for measuring alignment of current models 2021-11-20T00:03:39.144Z
A very crude deception eval is already passed 2021-10-29T17:57:29.475Z
Beth Barnes's Shortform 2021-09-21T12:54:50.997Z
Call for research on evaluating alignment (funding + advice available) 2021-08-31T23:28:49.121Z
Imitative Generalisation (AKA 'Learning the Prior') 2021-01-10T00:30:35.976Z
Debate update: Obfuscated arguments problem 2020-12-23T03:24:38.191Z
Looking for adversarial collaborators to test our Debate protocol 2020-08-19T03:15:26.732Z
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z


Comment by Beth Barnes (beth-barnes) on Naturalism and AI alignment · 2021-12-28T22:29:31.334Z · LW · GW

As written there, the strong form of the orthogonality thesis states 'there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.'

I don't know whether that's intended to mean the same as 'there are no types of goals that are more 'natural' or that are easier to build agents that pursue, or that you're more likely to get if you have some noisy process for creating agents'.

I feel like I haven't seen a good argument for the latter statement, and it seems intuitively wrong to me.

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-15T03:28:33.902Z · LW · GW

Yeah, I'm particular worried about the second comment/last paragraph - people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)

Comment by Beth Barnes (beth-barnes) on Visible Thoughts Project and Bounty Announcement · 2021-12-08T05:48:15.216Z · LW · GW

Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?

Comment by Beth Barnes (beth-barnes) on Visible Thoughts Project and Bounty Announcement · 2021-12-08T05:44:01.821Z · LW · GW

Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-08T04:26:39.220Z · LW · GW

Not sure what you mean by 'Hobbesian state of nature founding assumptions', although I'll admit I'm pretty sympathetic to Hobbesian view. You mean the claim about most creatures living in a Malthusian struggle? Do you think that's not true of non-human animals, or humans prior to availability of birth control? Or is your claim more like there's something about humans that should be viewed as a stable trend away from Malthusianism, not an anomaly?

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-08T04:20:01.617Z · LW · GW

some relevant ideas here maybe:

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-08T04:15:40.179Z · LW · GW

I guess I expect there to be a reasonable amount of computation taking place, and it seems pretty plausible a lot of these computations will be structured like agents who are taking part in the Malthusian competition. I'm sufficiently uncertain about how consciousness works that I want to give some moral weight to 'any computation at all', and reasonable weight to 'a computation structured like an agent'.

I think if you have malthusian dynamics you *do* have evolution-like dynamics.

I assume this isn't a crux, but fwiw I think it's pretty likely most vertebrates are moral patients

Comment by Beth Barnes (beth-barnes) on Some thoughts on why adversarial training might be useful · 2021-12-08T04:10:14.239Z · LW · GW

thanks, edited :)

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-08T01:30:38.116Z · LW · GW

It sounds like you're implying that you need humans around for things to be dystopic? That doesn't seem clear to me; the AIs involved in the Malthusian struggle might still be moral patients

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-07T22:35:53.072Z · LW · GW

I guess I was kind of subsuming this into 'benevolent values have become more common'

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-07T22:34:16.495Z · LW · GW

ah yeah, so the claim is something like 'if we think other humans have 'bad values', maybe in fact our values are the same and one of us is mistaken, and we'll get less mistaken over time'?

Comment by Beth Barnes (beth-barnes) on Considerations on interaction between AI and expected value of the future · 2021-12-07T20:43:38.942Z · LW · GW

Is this making a claim about moral realism? If so, why wouldn't it apply to a paperclip maximiser? If not, how do we distinguish between objective mistakes and value disagreements?

Comment by Beth Barnes (beth-barnes) on Visible Thoughts Project and Bounty Announcement · 2021-12-06T19:35:02.875Z · LW · GW
combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

I'm sympathetic to the desire to keep things simple, but I actually think that getting good at scalably collecting rich human data is probably the most valuable part of the project. I'd be really excited to see Anthropic either building an excellent internal human data team, or figuring out how to work productively with one of the existing human data provider startups.

Comment by Beth Barnes (beth-barnes) on Visible Thoughts Project and Bounty Announcement · 2021-12-06T19:26:52.284Z · LW · GW

I am very excited about finding scalable ways to collect large volumes of high-quality data on weird, specific tasks. This seems very robustly useful for alignment, and not something we're currently that good at. I'm a bit less convinced that this task itself is particularly useful.

Have you reached out to e.g. or another one of the companies that does human-data-generation-as-a-service?

Comment by Beth Barnes (beth-barnes) on Call for research on evaluating alignment (funding + advice available) · 2021-11-22T22:38:44.423Z · LW · GW

that definitely seems like a useful thing to measure! I looked into an example here:

Comment by Beth Barnes (beth-barnes) on A very crude deception eval is already passed · 2021-10-30T05:27:25.900Z · LW · GW

Instruction-following davinci model. No additional prompt material

Comment by Beth Barnes (beth-barnes) on Zoe Curzi's Experience with Leverage Research · 2021-10-13T17:39:35.948Z · LW · GW

Many of these things seem broadly congruent with my experiences at Pareto, although significantly more extreme. Especially: ideas about psychology being arbitrarily changeable, Leverage having the most powerful psychology/self-improvement tools, Leverage being approximately the only place you could make real progress, extreme focus on introspection and other techniques to 'resolve issues in your psyche', (one participant's 'research project' involved introspecting about how they changed their mind for 2 months) and general weird dynamics (e.g. instructors sleeping with fellows; Geoff doing lectures or meeting individually with participants in a way that felt very loaded with attempts to persuade and rhetorical tricks), and paranoia (for example: participants being concerned that the things they said during charting/debugging would be used to blackmail or manipulate them; or suspecting that the private slack channels for each participant involved discussion of how useful the participants were in various ways and how to 'make use of them' in future). On the other hand, I didn't see any of the demons/objects/occult stuff, although I think people were excited about 'energy healers'/'body work', not actually believing that there was any 'energy' going on, but thinking that something interesting in the realm of psychology/sociology was going on there. Also, I benefitted from the program in many ways, many of the techniques/attitudes were very useful, and the instructors generally seemed genuinely altruistic and interested in helping fellows learn.

Comment by Beth Barnes (beth-barnes) on Call for research on evaluating alignment (funding + advice available) · 2021-10-04T23:09:08.864Z · LW · GW

Yeah, I think you need some assumptions about what the model is doing internally.

I'm hoping you can handwave over cases like 'the model might only know X&A, not X' with something like 'if the model knows X&A, that's close enough to it knowing X for our purposes - in particular, if it thought about the topic or learned a small amount, it might well realise X'.

Where 'our purposes' are something like 'might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don't know X'?

Another way to put this is that for workable cases, I'd expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I'd expect suitable prompt engineering, fine-tuning... to be able to get the model to do task X.

It seems plausible to me that there are cases where you can't get the model to do X by finetuning/prompt engineering, even if the model 'knows' X enough to be able to use it in plans. Something like - the part of its cognition that's solving X isn't 'hooked up' to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any 'knowledge' that can be used to help you achieve stuff, but which is subconscious - your linguistic self can't report it directly (and further you can't train yourself to be able to report it)

Comment by Beth Barnes (beth-barnes) on Common knowledge about Leverage Research 1.0 · 2021-09-30T03:31:29.341Z · LW · GW

Wow, that is very bad. Personally I'd still trust Julia as someone to report harms from Leverage to, mostly from generally knowing her and knowing her relationship to Leverage, but I can see why you wouldn't.

Comment by Beth Barnes (beth-barnes) on Common knowledge about Leverage Research 1.0 · 2021-09-29T01:20:25.922Z · LW · GW

The basic outline is:

There were ~20 Fellows, mostly undergrad-aged with one younger and a few older.

Stayed in Leverage house for ~3 months in summer 2016 and did various trainings followed by doing a project with mentorship to apply things learnt from trainings

Training was mostly based on Leverage ideas but also included fast-forward versions of CFAR workshop, 80k workshop. Some of the content was taught by Leverage staff and some by CEA staff who were very 'in Leverage's orbit'.

I think most fellows felt that it was really useful in various ways but also weird and sketchy and maybe harmful in various other ways.

Several fellows ended up working for Leverage afterwards; the whole thing felt like a bit of a recruiting drive.

Comment by Beth Barnes (beth-barnes) on Common knowledge about Leverage Research 1.0 · 2021-09-29T01:07:55.707Z · LW · GW is the program's self-description.

Comment by Beth Barnes (beth-barnes) on Common knowledge about Leverage Research 1.0 · 2021-09-28T20:56:26.374Z · LW · GW
If anyone is aware of harms or abuses that have taken place involving staff at Leverage Research, please email me, in confidence, at or

I would suggest that anything in this vein should be reported to Julia Wise, as I believe she is a designated person for reporting concerns about community health, harmful behaviours, abuse, etc. She is unaffiliated with Leverage, and is a trained social worker.

Comment by Beth Barnes (beth-barnes) on Common knowledge about Leverage Research 1.0 · 2021-09-28T20:46:57.395Z · LW · GW
Using psychological techniques to experiment on one another, and on the "sociology" of the group itself, was a main purpose of the group. It was understood among members that they were signing up to be guinea pigs for experiments in introspection, altering one's belief structure, and experimental group dynamics.

The Pareto program felt like it had substantial components of this type of social/psychological experimentation, but participants were not aware of this in advance and did not give informed consent. Some (maybe most?) Pareto fellows, including me, were not even aware that Leverage was involved in any way in running the program until they arrived, and found out they were going to be staying in the Leverage house.

Comment by Beth Barnes (beth-barnes) on Beth Barnes's Shortform · 2021-09-22T19:21:14.230Z · LW · GW

You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.

Comment by Beth Barnes (beth-barnes) on Beth Barnes's Shortform · 2021-09-21T12:54:51.432Z · LW · GW

When can models report their activations?

Related to call for research on evaluating alignment

Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback):

Finetune a language model to report the activation of a particular neuron in text form.

E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation.

We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output. However, at lower layers it should be able to do this correctly, with some amount of finetuning.

How many layers do you have to go down before the model succeeds? How does this scale with (a) model size and (b) amount of training?

One subtlety is that finetuning might end up changing that neuron’s activation. To avoid this, we could do something like:
- Run the base model on the sentence

-Train the fine-tuned model to report the activation of the neuron in the base model, given the sentence

- Note whether the activation in the finetuned model is different

Why I think this is interesting:

I often round off alignment to 'build a model that tells us everything it “knows”’. It's useful to determine what pragmatic limits on this are. In particular, it's useful for current alignment research to be able to figure out what our models “know” or don't “know”, and this is helpful for that. It gives us more information about when ‘we tried finetuning the model to tell us X but it didn’t work’ means ‘the model doesn’t know X’, versus when the model may have a neuron that fires for X but is unable to report it in text.

Comment by Beth Barnes (beth-barnes) on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-08-12T03:53:59.455Z · LW · GW

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

Comment by Beth Barnes (beth-barnes) on Frequent arguments about alignment · 2021-06-23T01:46:49.106Z · LW · GW

You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models

Comment by Beth Barnes (beth-barnes) on Frequent arguments about alignment · 2021-06-23T01:45:23.672Z · LW · GW

I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.

Comment by Beth Barnes (beth-barnes) on AMA: Paul Christiano, alignment researcher · 2021-05-04T18:12:08.977Z · LW · GW

"Even if actively trying to push the field forward full-time I'd be a small part of that effort"

I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.

I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community

Comment by Beth Barnes (beth-barnes) on Imitative Generalisation (AKA 'Learning the Prior') · 2021-04-07T20:20:49.015Z · LW · GW

We're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there.

Does that make sense?

Comment by Beth Barnes (beth-barnes) on Imitative Generalisation (AKA 'Learning the Prior') · 2021-02-17T20:27:35.801Z · LW · GW
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?

I think the distinction isn't actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you're just doing IDA/debate, and you try to get your model to give you answers to questions where the model only knows the answer because of updating on a big dataset, you can either keep going through the big dataset when any question of this type comes up (very slow, so capability limitation), or not trust these answers (capability limitation), or just hope they're correct (safety problem).

Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I'd be happy to count an exponential speedup as the latter.)

The latter :)

I think the only way to get debate to be able to answer all the questions that debate+IG can answer is to include subtrees that are the size of your whole training dataset at arbitrary points in your debate tree, which I think counts as a ridiculous amount of compute

Comment by Beth Barnes (beth-barnes) on Debate Minus Factored Cognition · 2021-01-31T04:39:03.303Z · LW · GW

That is a concern, but only in the case where there's no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.

So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there's no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.

If we include discussion of cross-examination (which the analysis there did not include), then we can get rid of this constraint: each debater commits to an argument tree, then each debater points out the weakest node in the tree (or points out that some part of the tree doesn't bottom out).

(we can only handle really large trees if we assume debaters are computationally unbounded in general though. If we don't assume this, even if we still assume they have oracles for some specific problems, we still probably can't supervise anything that's not in NP, because of the obfuscated argument problem)

Comment by Beth Barnes (beth-barnes) on Debate Minus Factored Cognition · 2021-01-31T03:58:40.178Z · LW · GW

I don't think 'assuming one player is honest' and 'not trusting answers by default' are in contradiction. if the judge assumes one player is honest, then if they see two different answers they don't know which one to trust, but if they only see one answer (the debaters agree on an answer/the answer is not challenged by the opposing debater) then they can trust that answer.

Comment by Beth Barnes (beth-barnes) on AI safety via market making · 2021-01-31T03:54:15.673Z · LW · GW

I was trying to describe something that's the same as the judging procedure in that doc! I might have made a mistake, but I'm pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I'm happy to try to clarify, if there were particular aspects that seem different to you.

Yeah, I think the infinite tree case should work just the same - ie an answer that's only supported by an infinite tree will behave like an answer that's not supported (it will lose to an answer with a finite tree and draw with an answer with no support)

It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.

That's exciting!

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2021-01-13T01:35:47.827Z · LW · GW
In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might think this is a strategy that's possible for ML models, but that humans aren't naturally very good.

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)

Yeah, this is a great summary. One thing I would clarify is that it's sufficient that the set of arguments don't look suspicious to the judge. The arguments might look suspicious to the expert, but unless they have a way to explain to the judge why it's suspicious, we still have a problem.

If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify

Yeah, I think that is the obvious next step. The concern is that the reasons the argument is suspicious may be hard to justify in a debate, especially if they're reasons of the form 'look, I've done a bunch of physics problems, and approaching it this way feels like it will makes things messy, whereas approaching it this way feels cleaner'. Debate probably doesn't work very well for supervising knowledge that's gained through finding patterns in data, as opposed to knowledge that's gained through step-by-step reasoning. Something like imitative generalisation (AKA 'learning the prior') is trying to fill this gap.

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2021-01-13T01:25:09.389Z · LW · GW

When you say 'this approach', what are you referring to?

Comment by Beth Barnes (beth-barnes) on Imitative Generalisation (AKA 'Learning the Prior') · 2021-01-11T20:42:37.818Z · LW · GW
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements - this is the main reason that we need science.

Agree that humans are not necessarily great at assigning priors. The main response to this is that we don't have a way to get better priors than an amplified human's best prior. If amplified humans think the NN prior is better than their prior, they can always just use this prior. So in theory this should be both strictly better than the alternative, and the best possible prior we can use.

Science seems like it's about collecting more data and measuring the likelihood, not changing the prior. We still need to use our prior - there are infinite scientific theories that fit the data, but we prefer ones that are simple and elegant.

z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can't calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.

One thing that helps a bit here is that we can use an amplified human. We also don't need the human to calculate the prior directly, just to do things like assess whether some change makes the prior better or worse. But I'm not sure how much of a roadblock this is in practice, or what Paul thinks about this problem.

Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation?

Yeah, the important difference is that in this case there's nothing that constrains the explanations to be the same as the actual reasoning the oracle is using, so the explanations you're getting are not necessarily predictive of the kind of generalisation that will happen. In IG it's important that the quality of z is measured by having humans use it to make predictions.

An agent can "generalise badly" because it's not very robust, or because it's actively pursuing goals that are misaligned with those of humans. It doesn't seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?

I'm not sure exactly what you're asking. I think the proposal is motivated by something like: having the task be IID/being able to check arbitrary outputs from our model to make sure it's generalising correctly buys us a lot of safety properties. If we have this guarantee, we only have to worry about rare or probabilistic defection, not that the model might be giving us misleading answers for every question we can't check.

Comment by Beth Barnes (beth-barnes) on Debate Minus Factored Cognition · 2021-01-06T07:07:34.484Z · LW · GW

Thanks for the post, I'm excited that you're thinking about debate!

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 
Basically, it sounds like you're saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we're providing, then we're going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

It sounds like you're saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can't assume this, that presumably means that some reasonable fraction of all claims being made are dishonest (because if there were only a few dishonest claims, then they'd have honest defeaters and we'd have a clear training signal away from dishonesty, so after training for a bit we'd be able to trust the lower claims). This probably means that most debates will give us a bad answer (as you only need a few bad claims to invalidate the whole tree).  At this point, debate isn't really competitive, because it gives us dud answers almost all the time, and we're going to have to run an exponential number of debates before we happen on a correct one.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they're bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

But also, the 'amplified judge consulting sub-debates' sounds like it's just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree. 

Comment by Beth Barnes (beth-barnes) on Debate Minus Factored Cognition · 2021-01-06T06:53:07.313Z · LW · GW

The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.  

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2021-01-06T06:39:50.109Z · LW · GW

To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2021-01-06T06:36:31.062Z · LW · GW

I'd be interested to hear more detail of your thoughts on how we might use robustness techniques!

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2020-12-27T23:31:10.487Z · LW · GW

Yep, planning to put up a post about that soon. The short argument is something like:
The equivalent of an obfuscated argument for IDA is a decomposition that includes questions the model doesn't know how to answer. 
We can't always tell the difference between an IDA tree that uses an obfuscated decomposition and gets the wrong answer, vs an IDA tree that uses a good decomposition and gets the right answer, without unpacking the entire tree

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2020-12-24T03:33:34.630Z · LW · GW

I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you're going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time

Comment by Beth Barnes (beth-barnes) on Debate update: Obfuscated arguments problem · 2020-12-23T23:30:07.948Z · LW · GW


Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped. 

Comment by Beth Barnes (beth-barnes) on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-19T00:33:21.454Z · LW · GW

One counterexample is Manhattan Project - they developed two different designs simultaneously because they weren't sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.,Tube%20Alloys%20project)%20and%20Canada.

Comment by Beth Barnes (beth-barnes) on AI safety via market making · 2020-11-22T06:04:54.031Z · LW · GW

Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that's supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

Comment by Beth Barnes (beth-barnes) on Learning Normativity: A Research Agenda · 2020-11-18T19:31:29.306Z · LW · GW

I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we're correct about the hope that natural language is sufficiently universal. It's quite likely I'm either confused or being sloppy though.

You could put 'learning the prior' inside HCH I think, it would just be inefficient - for every claim, you'd ask your HCH tree how much you should believe it, and HCH would think about the correct way to do bayesian reasoning, what the prior on that claim should be, and how well it predicted every piece of data you'd seen so far, in conjunction with everything else in your prior. I think one view of learning the prior is just making this process more tractable/practical, and saving you from having to revisit all your data points every time you ask any question - you just do all the learning from data once, then use the result of that to answer any subsequent questions.

Comment by Beth Barnes (beth-barnes) on Learning Normativity: A Research Agenda · 2020-11-18T08:08:30.152Z · LW · GW

However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there... but as I've mentioned earlier, human imitation does not get us all the way. Humans don't perfectly endorse their own reactions.

Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human - e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and decide how much they endorse it.
I think the target is not to get normativity 'correct', but to design a mechanism such that we can't expect to find any mechanism that does better.

Comment by Beth Barnes (beth-barnes) on Extortion beats brinksmanship, but the audience matters · 2020-11-18T06:22:30.321Z · LW · GW

FYI/nit: at first glance I thought extorsion was supposed to mean something different from extortion (I've never seen it spelt with the s) and this was a little confusing. 

Comment by Beth Barnes (beth-barnes) on AI safety via market making · 2020-11-18T06:18:26.697Z · LW · GW

Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it. 
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node. 
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?