Posts
Comments
Testing sounds great though I'd note that the way I'd approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek's version of o1.) Then I'd move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)
Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there's more of a chance that there will be 'skill transfer' or 'skill leakage' so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.
Thanks for this description, I think we had previously talked about this and I wasn't sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I've realized on additional examination is that I don't see a reason why to expect that the models will specialize in this way with the training scheme you've proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn't output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don't necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I'm picking up this thread from 7 months ago, so I might be forgetting some important details.)
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.
Sadly, gathering evidence of misalignment in deployment seems likely to me to be one of the most effective strategies for gathering legible evidence (at least for early systems) given likely constraints. (E.g., because people won't believe results in text beds and because RL might be too expensive to run twice.)
Sure, I mean that I expect that for the critical regime for TAI, we can literally have the Face be 3.5 sonnet.
Or get some other work out of these systems such that you greatly reduce risk going forward.
For reference, @habryka has now posted them here.
I agree there are various ways the Face could cause problems via influencing the training of the actual policy. (Both more wild influence that would require clever manipulation and a more mundane reduction of disincentive for problematic reasoning (though I'd note this mundane issue is the core goal of the original proposal!).)
I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
It might be easier to try to establish some track record by doing a small research project first. I don't know if you have enough runway for this though.
To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won't can't make the situation worse.
My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures[2].
As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we're worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.
An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we'll understand how the model works which we can test by seeing if we can understand models in general).
Which isn't clearly possible given that the relevant AIs could have very different capability profiles from humans. I think this will be fine in practice, but this is non-obvious. ↩︎
I don't think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me. ↩︎
On the other hand, your version of Daniel's protocol loses this property.
I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.
And, these should clearly be separate models such that the training of one doesn't generalize to the other.
So, making the Face more deceptive doesn't kill the canary? The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).
Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?
(It might also be possible to reduce the extent to which the initial Face has control over the world. E.g., you can initialize one Face to use for the first part of training (including the proposal I was mentioning) and then train another later on to imitate the earlier Face on a subpart of the distribution you inspect with the hope that it doesn't inherit learned misalignment. You could also deploy a different Face than the one you use for training if you aren't doing online training.)
By "control", I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don't work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
- Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
- Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
- Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
We discuss our hopes more in this post.
You can still inspect the underlying CoT to know if you have big issues.
It is worth noting that the baseline proposal was already incentivizing the Face to be deceptive to the extent to which this proposal is important (the whole hope was to allow for hiding deception). So, I agree that my proposed change trades off "less deceptive CoT" for "more deceptive Face" relative to the baseline proposal, but this feels like this is mostly a quantitative difference?
(My guess is that you have this concern with the baseline protocol and thus don't like moving further in this direction.)
It could be that the baseline proposal doesn't make sense because shaping with process based feedback on the CoT is so useful that it is worth taking the risk of incentivizing unfaithful CoT.
make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)
OpenAI models are seemingly trained on huge amounts of chess data, perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.
I agree you'll be able to get some work out, but you might be taking a bit productivity hit.
Also, TBC, I'm not generally that worried about generic sandbagging on safety research relative to other problems.
I'm not so sure about this being easy to catch. Suppose we're actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we'll know we aren't getting out much value in this case, but that doesn't help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
Why not think that sandbagging could reduce productivity by ~100%? I think it can do this in human research organizations.
I happen to think this is pretty doable to address with countermeasures, but certainly without countermeasures this could be quite brutal.
As far as safety research sabotage, I'm often imagining careful sabotage rather than generic sandbagging.
As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.
I can imagine that this results in your automated safety research being worthless or of negative value.
TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.
You can use the first X% of scenarios at deployment time and audit them with humans.
Only if you don't have privacy guarantees?
Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.
Your control protocols should be able to prevent the best advisory you can simulate from breaking them.
Surely this will be unnecessarily conservative a bunch of the time?
In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn't done in practice since it is expensive and slow.
Or it is overly conservative?
No IMO.
(I'm also skeptical of competitiveness with expert jailbreakers, but this isn't a crux. Edit: It does seem to be competitive with Scale's jailbreakers, but I think some human jailbreakers are much better and I expect the gap is more stark for harder to jailbreak models.)
All of these operationalizations are about exact notions from the training setup.
Another important notion is revealed identity:
- Does the AI think of the other AI as "itself" to the extent that it thinks about stuff like this at all?
- Do the AIs cooperate in a way which is reasonably similar to cooperating with yourself?
Hmm, I really just mean that "labor" is probably the most important input to the current production function. I don't want to make a claim that there aren't better ways of doing things.
I agree that suboptimal recruiting/hiring also causes issues, but it isn't easy to solve this problem with money.
Hiring being highly selective does not imply things aren't constrained on people.
Getting 10x as many people as good as the top 20 best safety researchers would make a huge difference.
Joe's argument here would actually be locally valid if we changed:
a sufficient number of IQ 100 agents with sufficient time can do anything that an IQ 101 agent can do
to:
a sufficient number of IQ 100 agents with sufficient time can do anything that some number of IQ 101 agents can do eventually
We can see why this works when applied to your analogy. If we change:
A sufficient number of 4yo’s could pick up any weight that a 5yo could pick up
to
A sufficient number of 4yo’s could pick up any weight that some number of 5yo's could pick up
Then we can see where the issue comes in. The problem is that while a team of 4yo's can always beat a single 5yo, there exists some number of 5yo's which can beat any number of 4yo's.
If we fix the local validity issue in Joe's argument like this, it is easier to see where issues might crop up.
I do think this was reasonably though not totally predictable ex-ante, but I agree.
it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA
It's not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you'd get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal "fraction of loss explained by human written explanations" both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don't think SAEs clearly are "better" than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don't think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Somewhat off-topic, but isn't this a non-example:
We have strong evidence that interesting semantic features exist in superposition
I think a more accurate statement would be "We have a strong evidence that neurons don't do a single 'thing' (either in the human ontology or in any other natural ontology)" combined with "We have a strong evidence that the residual stream represents more 'things' than it has dimensions".
Aren't both of these what people would (and did) predict without needing to look at models at all?[1] As in both of these are the null hypothesis in a certain sense. It would be kinda specific if neurons did do a single thing and we can rule out 1 "thing" per dimension in the residual stream via just noting that Transformers work at all.
I think there are more detailed models of a more specific thing called "superposition" within toy models, but I don't think we have strong evidence of any very specific claim about larger AIs.
(SAE research has shown that SAEs often find directions which seem to at least roughly correspond to concepts that humans have and which can be useful for some methods, but I'm don't think we can make a much stronger claim at this time.)
In fact, I think that mech interp research was where the hypothesis "maybe neurons represent a single thing and we can understand neurons quite well (mostly) in isolation" was raised. And this hypothesis seems to look worse than a more default guess about NNs being hard to understand and there not being an easy way to decompose them into parts for analysis. ↩︎
It depends on the type of transparency.
Requirements which just involve informing one part of the government (say US AISI) in ways which don't cost much personnel time mostly have the effect of potentially making the government much more aware at some point. I think this is probably good, but presumably labs would prefer to retain flexibility and making the government more aware can go wrong (from the lab's perspective) in ways other than safety focused regulation. (E.g., causing labs to be merged as part of a national program to advance capabilities more quickly.)
Whistleblower requirements with teeth can have information leakage concerns. (Internal only whistleblowering policies like Anthropic's don't have this issue, but also have basically no teeth for the company overall.)
Any sort of public discussion (about e.g. threat models, government involvement, risks) can have various PR and reputational cost. (Idk if you were counting this under transparency.)
To the extent you expect to disagree with third party inspectors about safety (and they aren't totally beholden to you), this might end up causing issues for you later.
I'm not claiming that "reasonable" labs shouldn't do various types of transparency unilaterially, but I don't think the main cost is in making safety focused regulation more likely.
Oh huh, parse error on me.
Something between training the whole model with RL and BoN is training just the last few layers of the model (for current architectures) with RL and then doing BoN on top as needed to increase performance. This means most of the model won't know the information (except insofar as the info shows up in outputs) and allows you to get some of the runtime cost reductions of using RL rather than BoN.
I'm just talking about research intended to be safety/safety-adjacent. As in, of this research, what has the quality weighted differential safety progress been.
Probably the word "differential" was just a mistake.
Low importance aside:
Seems slightly unproductive and unnecessarily connotation inducing to say "Anthropic says:". Maybe "The Sabotage eval paper (from Anthropic) says:". Like this was just written by various people on the Anthropic alignment science team and by no means should discussion sections of papers be interpreted to speak for the company overall. Obviously not very important.
Counting the number of papers isn't going to be a good strategy.
I do think total research outside of labs looks competitive with research from labs and probably research done outside of labs has produced more differential safety progress in total.
I also think open weight models are probably good so far in terms of making AI more likely to go well (putting aside norms and precedents of releases), though I don't think this is necessarily implied from "more research happens outside of labs".
We discuss similar issues in meta-level adversarial evaluations of oversight.
The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.
Human monitoring might be too expensive, in which case we'll have to go with sandboxing and pure proxy.
This paper seems like a great first stab at starting to answer very important questions about scalable oversight sample efficiency (at least for non-scheming models). Better understanding of sample effiiciency seems like it should have a big effect research prioritization for scalable oversight (and W2SG).
Despite this being a great start, I don't feel like I yet know the answers to these questions. Future work would need to explore methods somewhat more[1], do cleaner scaling experiments by using a nicer model stack[2], and use more analogous datasets, particularly preference modeling scores on data sets of difficult agentic tasks. Also, It might eventually be a good idea to run experiments with human labelers in case there are important differences between weak human labels and weak LLM labels.
I think there are a few promising seeming methods which weren't tried, but I don't have a good sense of what exploration was done outside of the methods listed in the body of the paper. ↩︎
This might need to be done at labs, because I don't think there is a great open source model stack that goes all the way to pretty powerful models. I think llama-3 is the best and it only has 3 models which were trained in mostly comparable ways I think? Pythia was pretty good on standardization, but fails to go to sufficiently powerful models. ↩︎
Gotcha, I interpreted your comment as implying you were interested in trying to improve your views on the topic in collaboration with someone else (who is also interested in improving their views on the topic).
So I thought it was relevant to point out that people should probably mostly care about a different question.
the magnitude of effects of such work is large enough that "positive sign" often is enough information to decide that it dominates many alternatives, though certainly not all of them
FWIW, my guess is that this is technically true if you mean something broad by "many alternatives", but if you mean something like "the best several alternatives that you would think of if you spent a few days thinking about it and talking to people" then I would disagree.
Isn't the most relevant question whether it is the best choice for you? (Taking into account your objectives which are (mostly?) altruistic.)
I'd guess having you work on capabilities at Anthropic is net good for the world[1], but probably isn't your best choice long run and plausibly isn't your best choice right now. (I don't have a good understanding of your alternatives.)
My current view is that working on capabilites at Anthropic is a good idea for people who are mostly altruistically motivated if and only if that person is very comparatively advantaged at doing capabilies at Anthropic relative to other similarly altruistically motivated people. (Maybe if they are in the top 20% or 10% of comparatively advantage among this group of similarly motivated people.)
Because I think Anthropic being more powerful/successful is good, the experience you'd gain is good, and the influence is net positive. And these factors are larger than the negative externalities on advacing AI for other actors. ↩︎
I would say that the "internal use restrictions" are just an ad hoc control argument and I'd prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.
Here is an example of what this could look like:
- The model isn't quite superhuman in any important way. Supporting claims:
- We're confident the model isn't wildly more powerful than it appears. Evidence: [you have to fill out]
- The model appears to not be superhuman in any important way. Evidence: We ran some capability evals.
- Our deployment is so conservative that as long as the model isn't quite super human, the model couldn't do anything even if it wanted to. Supporting claims:
- We thought about it some and it seems so conservative that it is unlikely we could be wrong even though we didn't spend that long doing threat modeling.
- We've actually implemented the relevant infrastructure (for sandboxing etc) and it is used in all the relevant places (including in RL and evaluations of the model).
- We'll enforce this level of conservativeness in a sufficiently robust way that we're confident that employee error won't be an issue. Evidence: [you have to fill out]
Note that this safety case doesn't require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.
I think your internal usage restrictions are wildly unaffordable
Sure, but note that Evan says "Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage."
I don't think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)
I think Evan is just trying to say "using the model some (but not for XYZ) wouldn't (shouldn't?) even require making an affirmative safety case (at least for the first point at which this triggers)".
I think this can be consistent with treating models like external contractors.
(Though I'd note that "treat the model like external contractors" is maybe too weak and I'd replace this with something like "treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary".)
I do think that making an affirmative safety case for a "very low" level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).
Our evaluations indicate that we have very poor levels of safety (e.g. the AI would probably be able to escape if it wanted to) and we can't find countermeasures which suffice to ensure any level of safety (without basically giving up on using this model).
(Huh, good to know this changed. I wasn't aware of this.)
"AI frontier labs" is a bad name. Maybe "Frontier AI companies" which I think is more standard?
(I think random (non-leadership) GDM employees generally have a lot of freedom while employees of other companies have much less in-practice freedom (except for maybe longer time OpenAI employees who I think have a lot of freedom).)
I'm pretty skeptical that Neel's MATS stream is partially supported/subsidized by GDM's desire to generally hire for capabilities . (And I certainly don't think they directly fund this.) Same for other mech interp hiring at GDM, I doubt that anyone is thinking "these mech interp employees might convert into employees for capabilities". That said, this sort of thinking might subsidize the overall alignment/safety team at GDM to some extent, but I think this would mostly be a mistake for the company.
Seems plausible that this is an explicit motivation for junior/internship hiring on the Anthropic interp team. (I don't think the Anthropic interp team has a MATS stream.)
I think the primary commercial incentive on mechanistic interpretability research is that it's the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives.
Is your claim here that a major factor in why Anthropic and GDM do mech interp is to train employees who can later be commercially useful? I'm skeptical of this.
Maybe the claim is that many people go into mech interp so they can personally skill up and later might pivot into something else (including jobs which pay well)? This seems plausible/likely to me, though it is worth noting that this is a pretty different argument with very different implications from the one in the post.
I'm sympathetic to 'a high fraction of "alignment/safety" work done at AI companies is done due to commercial incentives and has negligible effect on AI takeover risk (or at least much smaller effects than work which isn't influenced by commercial incentives)'.
I also think a decent number of ostensibly AI x-risk focused people end up being influenced by commercial incentives sometimes knowingly and directly (my work will go into prod if it is useful and this will be good for XYZ reason; my work will get more support if it is useful; it is good if the AI company I work for is more successful/powerful, so I will do work which is commercially useful) and sometimes unknowingly or indirectly (a direction gets more organizational support because of usefulness; people are misled into thinking something is more x-risk-helpful than it actually is).
(And a bunch of originally AI x-risk focused people end up working on things which they would agree aren't directly useful for mitigating x-risk, but have some more complex story.)
I also think AI companies generally are a bad epistemic environment for x-risk safety work for various reasons.
However, I'm quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than 1/2 funded via the directly commercial case. (I won't justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)
(That said, I think putting some effort on (mech) interp (or something similar) might end up being a decent commercial bet via direct usage, though I'm skeptical.)
I think there are some adjacent reasons alignment/safety work might be funded/encouraged at AI companies beyond direct commercial usage:
- Alignment/safety teams or work might be subsidized via being good internal/external PR for AI companies. As in good PR to all of: the employees who care about safety, the AI-safety-adjacent community who you recruit from, and the broader public. I think probably most of this effect is on the "appeasing internal/external people who might care about safety" rather than the broader public.
- Having an internal safety team might help reduce/avoid regulatory cost for your company or the industry. Both via helping in compliance and via reducing the burden.
My guess would be that they're not saying that well-designed control evaluations become untrustworthy
It's a bit messy because we have some ability to check whether we should be able to evaluate things.
So, there are really three relevant "failure" states for well done control:
- We can't find countermeasures such that our control evaluations indicate any real safety.
- We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well.
- We think control evaluations can work well, but we're wrong and they actually don't.
I think (1) or (2) will likely happen prior to (3) if you do a good job.
We discuss this more here.
(Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)