Posts
Comments
I think I probably agree, although I feel somewhat wary about it. My main hesitations are:
- The lack of epistemic modifiers seems off to me, relative to the strength of the arguments they’re making. Such that while I agree with many claims, my imagined reader who is coming into this with zero context is like “why should I believe this?” E.g., “Without intervention, humanity will be summarily outcompeted and relegated to irrelevancy,” which like, yes, but also—on what grounds should I necessarily conclude this? They gave some argument along the lines of “intelligence is powerful,” and that seems probably true, but imo not enough to justify the claim that it will certainly lead to our irrelevancy. All of this would be fixed (according to me) if it were framed more as like “here are some reasons you might be pretty worried,” of which there are plenty, or "here's what I think," rather than “here is what will definitely happen if we continue on this path,” which feels less certain/obvious to me.
- Along the same lines, I think it’s pretty hard to tell whether this piece is in good faith or not. E.g., in the intro Connor writes “The default path we are on now is one of ruthless, sociopathic corporations racing toward building the most intelligent, powerful AIs as fast as possible to compete with one another and vie for monopolization and control of both the market and geopolitics.” Which, again, I don’t necessarily disagree with, but my imagined reader with zero context is like “what, really? sociopaths? control over geopolitics?” I.e., I’m expecting readers to question the integrity of the piece, and to be more unsure of how to update on it (e.g. "how do I know this whole thing isn't just a strawman?" etc.).
- There are many places where they kind of just state things without justifying them much. I think in the best case this might cause readers to think through whether such claims make sense (either on their own, or by reading the hyperlinked stuff—both of which put quite a lot of cognitive load on them), and in the worst case just causes readers to either bounce or kind of blindly swallow what they’re saying. E.g., “Black-Box Evaluations can only catch all relevant safety issues insofar as we have either an exhaustive list of all possible failure modes, or a mechanistic model of how concrete capabilities lead to safety risks.” They say this without argument and then move on. And although I agree with them (having spent a lot of time thinking this through myself), it’s really not obvious at first blush. Why do you need an exhaustive list? One might imagine, for instance, that a small number of tests would generalize well. And do you need mechanistic models? Sometimes medicines work safely without that, etc., etc. I haven’t read the entire Compendium closely, but my sense is that this is not an isolated incident. And I don't think this is a fatal flaw or anything—they're moving through a ton of material really fast and it's hard to give a thorough account of all claims—but it does make me more hesitant to use it as the default "here's what's happening" document.
All of that said, I do broadly agree with the set of arguments, and I think it’s a really cool activity for people to write up what they believe. I’m glad they did it. But I’m not sure how comfortable I feel about sending it to people who haven’t thought much about AI.
"It's plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn't want things to get incredibly crazy so fast, and so we're likely to see something tamer." I basically agree with that.
I feel confused about how this squares with Dario’s view that AI is "inevitable," and "driven by powerful market forces." Like, if humanity starts producing a technology which makes practically all aspects of life better, the idea is that this will just… stop? I’m sure some people will be scared of how fast it’s going, but it’s hard for me to see the case for the market in aggregate incentivizing less of a technology which fixes ~all problems and creates tremendous value. Maybe the idea, instead, is that governments will step in...? Which seems plausible to me, but as Ryan notes, Dario doesn’t say this.
Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.
That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wiggle room that I didn’t interpret it as imposing real constraints on Anthropic’s actions; but it did at least seem to me to be aiming at well-specified “if’s,” i.e., ones which depended on the results of specific evaluations. Like, the previous RSP describes their response policy as: “If an evaluation threshold triggers, we will follow the following procedure” (emphasis mine), where the trigger for autonomous risk happens if “at least 50% of the tasks are passed.”
In other words, the “if’s” in the first RSP seemed more objective to me; the current RSP strikes me as a downgrade in that respect. Now, instead of an evaluation threshold, the “if” is determined by some opaque internal process at Anthropic that the document largely doesn’t describe. I think in practice this is what was happening before—i.e., that the policy basically reduced to Anthropic crudely eyeballing the risk—but it’s still disappointing to me to see this level of subjectivity more actively codified into policy.
My impression is also that this RSP is more Orwellian than the first one, and this is part of what I was trying to gesture at. Not just that their decision process has become more ambiguous and subjective, but that the whole thing seems designed to be glossed over, such that descriptions of risks won’t really load in readers’ minds. This RSP seems much sparser on specifics, and much heavier on doublespeak—e.g., they use the phrase “unable to make the required showing” to mean “might be terribly dangerous.” It also seems to me to describe many things too vaguely to easily argue against. For example, they claim they will “explain why the tests yielded such results,” but my understanding is that this is mostly not possible yet, i.e., that it’s an open scientific question, for most such tests, why their models produce the behavior they do. But without knowing what “tests” they mean, nor the sort of explanations they’re aiming for, it’s hard to argue with; I’m suspicious this is intentional.
In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if evaluations triggered, then they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almost maximally flexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? What would cause Anthropic to make clear commitments?
I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions:
We replaced our previous autonomous replication and adaption (ARA) threshold with a “checkpoint” for autonomous AI capabilities. Rather than triggering higher safety standards automatically, reaching this checkpoint will prompt additional evaluation of the model’s capabilities and accelerate our preparation of stronger safeguards.
This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead?
Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.
Basically I just agree with what James said. But I think the steelman is something like: you should expect shorter (or no) pauses with an RSP if all goes well, because the precautions are matched to the risks. Like, the labs aim to develop safety measures which keep pace with the dangers introduced by scaling, and if they succeed at that, then they never have to pause. But even if they fail, they're also expecting that building frontier models will help them solve alignment faster. I.e., either way the overall pause time would probably be shorter?
It does seem like in order to not have this complaint about the RSP, though, you need to expect that it's shorter by a lot (like by many months or years). My guess is that the labs do believe this, although not for amazing reasons. Like, the answer which feels most "real" to me is that this complaint doesn't apply to RSPs because the labs aren't actually planning to do a meaningful pause.
Does the category “working/interning on AI alignment/control” include safety roles at labs? I’d be curious to see that statistic separately, i.e., the percentage of MATS scholars who went on to work in any role at labs.
Similarly in Baba is You: when people don't have a crisp understanding of the puzzle, they tend to grasp and straws and motivatedly-reason their way into accepting sketchy sounding premises. But, the true solution to a level often feels very crisp and clear and inevitable.
A few of the scientists I’ve read about have realized their big ideas in moments of insight (e.g., Darwin for natural selection, Einstein for special relativity). My current guess about what’s going on is something like: as you attempt to understand a concept you don’t already have, you’re picking up clues about what the shape of the answer is going to look like (i.e., constraints). Once you have these constraints in place, your mind is searching for something which satisfies all of them (both explicitly and implicitly), and insight is the thing that happens when you find a solution that does.
At least, this is what it feels like for me when I play Baba is You (i.e., when I have the experience you’re describing here). I always know when a fake solution is fake, because it’s really easy to tell that it violates one of the explicit constraints the game has set out (although sometimes in desperation I try it anyway :p). But it’s immediately clear when I've landed on the right solution (even before I execute it), because all of the constraints I’ve been holding in my head get satisfied at once. I think that’s the “clicking” feeling.
Darwin’s insight about natural selection was also shaped by constraints. His time on the Beagle had led him to believe that “species gradually become modified,” but he was pretty puzzled as to how the changes were being introduced. If you imagine a beige lizard that lives in the sand, for instance, it seems pretty clear that it isn’t the lizard itself (its will) which causes its beigeness, nor is it the sand that directly causes the coloring (as in, physically causes it within the lizards lifetime). But then, how are changes introduced, if not by the organism, and not by the environment directly? He was stuck on this for awhile, when: “I can remember the very spot in the road, whilst in my carriage, when to my joy the solution occurred to me.”
There’s more going on to Darwin’s story than that, but I do think it has elements of the sort of thing you're describing here. Jeff Hawkins also describes insight as a constraint satisfaction problem pretty explicitly (I might’ve gotten this idea from him), and he experienced it when coming up with the idea of a thousand brains.
Anyway, I don’t have a strong sense of how crucial this sort of thing is to novel conceptual inquiry in general, but I do think it’s quite interesting. It seems like one of the ways that someone can go from a pre-paradigmatic grasping around for clues sort of thing to a fully formed solution.
I'm somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI's paper says: "Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024," so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I'm curious.
You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it's good if evaluators are just considered normal employees and aren't separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival "AI safety" tribe.) This probably has a variety of benefits in terms of support from the company. For example, when evaluators make a decision with is very costly for the company, it is more likely to respected by other employees.
This situation seems backwards to me. Like, presumably the ideal scenario is that a risk evaluator estimates the risk in an objective way, and then the company takes (hopefully predefined) actions based on that estimate. The outcome of this interaction should not depend on social cues like how loyal they seem, or how personally costly it was for them to communicate that information. To the extent it does, I think this is evidence that the risk management framework is broken.
Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.
You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right call to begin with. There are clearly some safety benefits to having access to frontier models, but the question is: are those benefits worth the cost? Given that this is (imo) by far the most important strategic consideration for Anthropic, I’m hoping for far more elaboration here. Why does Anthropic believe it’s important to work on advancing capabilities at all? Why is it worth the potentially world-ending costs?
This section also doesn’t explain why Anthropic needs to advance the frontier. For instance, it isn’t clear to me that anything from “Chapter 1” requires this—does remaining slightly behind the frontier prohibit Anthropic from e.g. developing automated red-teaming, or control techniques, or designing safety cases, etc.? Why? Indeed, as I understand it, Anthropic’s initial safety strategy was to remain behind other labs. Now Anthropic does push the frontier, but as far as I know no one has explained what safety concerns (if any) motivated this shift.
This is especially concerning because pushing the frontier seems very precarious, in the sense you describe here:
If [evaluations are] significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
... and here:
As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.
But without a clear sense of why advancing the frontier is helpful for safety in the first place, it seems pretty easy to imagine missing this narrow target.
Like, here is a situation I feel worried about. We continue to get low quality evidence about the danger of these systems (e.g. via red-teaming). This evidence is ambiguous and confusing—if a system can in fact do something scary (such as insert a backdoor into another language model), what are we supposed to infer from that? Some employees might think it suggests danger, but others might think that it, e.g., wouldn’t be able to actually execute such plans, or that it’s just a one-off fluke but still too incompetent to pose real threat, etc. How is Anthropic going to think about this? The downside of being wrong is, as you’ve stated, extreme: a long enough pause could kill the company. And the evidence itself is almost inevitably going to be quite ambiguous, because we don’t understand what’s happening inside the model such that it’s producing these outputs.
But so long as we don’t understand enough about these systems to assess their alignment with confidence, I am worried that Anthropic will keep deciding to scale. Because when the evidence is as indirect and unclear as that which is currently possible to gather, interpreting it is basically just a matter of guesswork. And given the huge incentive to keep scaling, I feel skeptical that Anthropic will end up deciding to interpret anything but unequivocal evidence as suggesting enough danger to stop.
This is concerning because Anthropic seems to anticipate such ambiguity, as suggested by the RSP lacking any clear red lines. Ideally, if Anthropic finds that their model is capable of, e.g., self-replication, then this would cause some action like “pause until safety measures are ready.” But in fact what happens is this:
If sufficient measures are not yet implemented, pause training and analyze the level of risk presented by the model. In particular, conduct a thorough analysis to determine whether the evaluation was overly conservative, or whether the model indeed presents near-next-ASL risks.
In other words, one of the first steps Anthropic plans to take if a dangerous evaluation threshold triggers, is to question whether that evaluation was actually meaningful in the first place. I think this sort of wiggle room, which is pervasive throughout the RSP, renders it pretty ineffectual—basically just a formal-sounding description of what they (and other labs) were already doing, which is attempting to crudely eyeball the risk.
And given that the RSP doesn’t bind Anthropic to much of anything, so much of the decision making largely hinges on the quality of its company culture. For instance, here is Nick Joseph’s description:
Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.
[...]
But I do agree that ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to a researcher working on this actually trying their best at it. And that is quite core, and I think that will just continue to be.
Which is to say that Anthropic’s RSP doesn’t appear to me to pass the LeCun test. Not only is the interpretation of the evidence left up to Anthropic’s discretion (including retroactively deciding whether a test actually was a red line), but the quality of the safety tests themselves are also a function of company culture (i.e., of whether researchers are “actually trying their best” to “solicit capabilities well enough.”)
I think the LeCun test is a good metric, and I think it’s good to aim for. But when the current RSP is so far from passing it, I’m left wanting to hear more discussion of how you’re expecting it to get there. What do you expect will change in the near future, such that balancing these delicate tradeoffs—too lax vs. too strict, too vague vs. too detailed, etc.—doesn’t result in another scaling policy which also doesn’t constrain Anthropic’s ability to scale roughly at all? What kinds of evidence are you expecting you might encounter, that would actually count as a red line? Once models become quite competent, what sort of evidence will convince you that the model is safe? Aligned? And so on.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.
I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:
When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results.
This isn’t a baseline of reasonable practices upon which better safety practices can evolve, so much as a baseline of zero—the fact that DeepMind will take some unspecified action, rather than none, if an evaluation triggers does not appear to me to be much of a safety strategy at all.
Which is especially concerning, given that DeepMind might create AGI within the next few years. If it’s too early to make commitments now, when do you expect that will change? What needs to happen, before it becomes reasonable to expect labs to make safety commitments? What sort of knowledge are you hoping to obtain, such that scaling becomes a demonstrably safe activity? Because without any sense of when these “best practices” might emerge, or how, the situation looks alarmingly much like the labs requesting a blank check for as long as they deem fit, and that seems pretty unacceptable to me.
AI is a nascent field. But to my mind its nascency is what’s so concerning: we don’t know how to ensure these systems are safe, and the results could be catastrophic. This seems much more reason to not build AGI than it does to not offer any safety commitments. But if the labs are going to do the former, then I think they have a responsibility to provide more than a document of empty promises—they should be able to state, exactly, what would cause them to stop building these potentially lethal machines. So far no lab has succeeded at this, and I think that’s pretty unacceptable. If labs would like to take up the mantle of governing their own safety, then they owe humanity the service of meaningfully doing it.
I'm sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don't know what the decision process inside of Anthropic will look like if an evaluation indicates something like "yeah, it's excellent at inserting backdoors, and also, the vibe is that it's overall pretty capable." And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it'll make these decisions (imo).
I will also note what I feel is a somewhat concerning trend. It's happened many times now that I've critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: "this wouldn't seem so bad if you knew what was happening behind the scenes." They of course cannot tell me what the "behind the scenes" information is, so I have no way of knowing whether that's true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of "we're doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are" is pretty sketchy.
Largely agree with everything here.
But, I've heard some people be concerned "aren't basically all SSP-like plans basically fake? is this going to cement some random bureaucratic bullshit rather than actual good plans?." And yeah, that does seem plausible.
I do think that all SSP-like plans are basically fake, and I’m opposed to them becoming the bedrock of AI regulation. But I worry that people take the premise “the government will inevitably botch this” and conclude something like “so it’s best to let the labs figure out what to do before cementing anything.” This seems alarming to me. Afaict, the current world we’re in is basically the worst case scenario—labs are racing to build AGI, and their safety approach is ~“don’t worry, we’ll figure it out as we go.” But this process doesn’t seem very likely to result in good safety plans either; charging ahead as is doesn’t necessarily beget better policies. So while I certainly agree that SSP-shaped things are woefully inadequate, it seems important, when discussing this, to keep in mind what the counterfactual is. Because the status quo is not, imo, a remotely acceptable alternative either.
Or independent thinkers try to find new frames because the ones on offer are insufficient? I think this is roughly what people mean when they say that AI is "pre-paradigmatic," i.e., we don't have the frames for filling to be very productive yet. Given that, I'm more sympathetic to framing posts on the margin than I am to filling ones, although I hope (and expect) that filling-type work will become more useful as we gain a better understanding of AI.
I think this letter is quite bad. If Anthropic were building frontier models for safety purposes, then they should be welcoming regulation. Because building AGI right now is reckless; it is only deemed responsible in light of its inevitability. Dario recently said “I think if [the effects of scaling] did stop, in some ways that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose… It’s a fact of nature… We just get to find out which world we live in, and then deal with it as best we can.” But it seems to me that lobbying against regulation like this is not, in fact, inevitable. To the contrary, it seems like Anthropic is actively using their political capital—capital they had vaguely promised to spend on safety outcomes, tbd—to make the AI arms race counterfactually worse.
The main changes that Anthropic has proposed—to prevent the formation of new government agencies which could regulate them, to not be held accountable for unrealized harm—are essentially bids to continue voluntary governance. Anthropic doesn’t want a government body to “define and enforce compliance standards,” or to require “reasonable assurance” that their systems won’t cause a catastrophe. Rather, Anthropic would like for AI labs to only be held accountable if a catastrophe in fact occurs, and only so much at that, as they are also lobbying to have their liability depend on the quality of their self-governance: “but if a catastrophe happens in a way that is connected to a defect in a company’s SSP, then that company is more likely to be liable for it.” Which is to say that Anthropic is attempting to inhibit the government from imposing testing standards (what Anthropic calls “pre-harm”), and in general aims to inhibit regulation of AI before it causes mass casualty.
I think this is pretty bad. For one, voluntary self-governance is obviously problematic. All of the labs, Anthropic included, have significant incentive to continue scaling, indeed, they say as much in this document: “Many stakeholders reasonably worry that this [agency]... might end up… impeding innovation in general.” And their attempts to self-govern are so far, imo, exceedingly weak—their RSP commits to practically nothing if an evaluation threshold triggers, leaving all of the crucial questions, such as “what will we do if our models show catastrophic inclinations,” up to Anthropic’s discretion. This is clearly unacceptable—both the RSP in itself, but also Anthropic’s bid for it to continue to serve as the foundation of regulation. Indeed, if Anthropic would like for other companies to be safer, which I believed to be one of their main safety selling points, then they should be welcoming the government stepping in to ensure that.
Afaict their rationale for opposing this regulation is that the labs are better equipped to design safety standards than the government is: “AI safety is a nascent field where best practices are the subject of original scientific research… What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.” But there is also, imo, a large chance that Anthropic is wrong about what is actually effective at preventing catastrophic risk, especially so, given that they have incentive to play down such risks. Indeed, their RSP strikes me as being incredibly insufficient at assuring safety, as it is primarily a reflection of our ignorance, rather than one built from a scientific understanding, or really any understanding, of what it is we’re creating.
I am personally very skeptical that Anthropic is capable of turning our ignorance into the sort of knowledge capable of providing strong safety guarantees anytime soon, and soon is the timeframe by which Dario aims to build AGI. Such that, yes, I expect governments to do a poor job of setting industry standards, but only because I expect that a good job is not possible given our current state of understanding. And I would personally rather, in this situation where labs are racing to build what is perhaps the most powerful technology ever created, to err on the side of the government guessing about what to do, and beginning to establish some enforcement about that, than to leave it for the labs themselves to decide.
Especially so, because if one believes, as Dario seems to, that AI has a significant chance of causing massive harm, that it could “destroy us,” and that this might occur suddenly, “indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot,” then you shouldn’t be opposing regulation which could, in principle, stop this from happening. We don’t necessarily get warning shots with AI, indeed, this is one of the main problems with building it “iteratively,” one of the main problems with Anthropic’s “empirical” approach to AI safety. Because what Anthropic means by “a pessimistic scenario” is that “it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves.” Simply an empirical fact. And in what worlds do we learn this empirical fact without catastrophic outcomes?
I have to believe that Anthropic isn’t hoping to gain such evidence by way of catastrophes in fact occurring. But if they would like for such pre-harm evidence to have a meaningful impact, then it seems like having pre-harm regulation in place would be quite helpful. Because one of Anthropic’s core safety strategies rests on their ability to “sound the alarm,” indeed, this seems to account for something like ~33% of their safety profile, given that they believe “pessimistic scenarios” are around as likely as good, or only kind of bad scenarios. And in “pessimistic” worlds, where alignment is essentially unsolvable, and catastrophes are impending, their main fallback is to alert the world of this unfortunate fact so that we can “channel collective effort” towards some currently unspecified actions. But the sorts of actions that the world can take, at this point, will be quite limited unless we begin to prepare for them ahead of time.
Like, the United States government usually isn’t keen on shutting down or otherwise restricting companies on the basis of unrealized harm. And even if they were keen, I’m not sure how they would do this—legislation likely won’t work fast enough, and even if the President could sign an executive order to e.g. limit OpenAI from releasing or further creating their products, this would presumably be a hugely unpopular move without very strong evidence to back it up. And it’s pretty difficult for me to see what kind of evidence this would have to be, to take a move this drastic and this quickly. Anything short of the public witnessing clearly terrible effects, such as mass casualty, doesn’t seem likely to pass muster in the face of a political move this extreme.
But in a world where Anthropic is sounding alarms, they are presumably doing so before such catastrophes have occurred. Which is to say that without structures in place to put significant pressure on or outright stop AI companies on the basis of unrealized harm, Anthropic’s alarm sounding may not amount to very much. Such that pushing against regulation which is beginning to establish pre-harm standards makes Anthropic’s case for “sounding the alarm”—a large fraction of their safety profile—far weaker, imo. But I also can’t help but feeling that these are not real plans; not in the beliefs-pay-rent kind of way, at least. It doesn’t seem to me that Anthropic has really gamed out what such a situation would look like in sufficient detail for it to be a remotely acceptable fallback in the cases where, oops, AI models begin to pose imminent catastrophic risk. I find this pretty unacceptable, and I think Anthropic’s opposition to this bill is yet another case where they are at best placing safety second fiddle, and at worst not prioritizing it meaningfully at all.
I agree that scoring "medium" seems like it would imply crossing into the medium zone, although I think what they actually mean is "at most medium." The full quote (from above) says:
In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.
I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they're in the "medium zone" and they can’t deploy. But if they’re all medium, then they're in the "below medium zone" and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.
Maybe I'm missing the relevant bits, but afaict their preparedness doc says that they won't deploy a model if it passes the "medium" threshold, eg:
Only models with a post-mitigation score of "medium" or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.
The threshold for further developing is set to "high," though. I.e., they can further develop so long as models don't hit the "critical" threshold.
But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".
But doesn't this argument hold with the opposite conclusion, too? E.g. "abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds".
But whether an organization can easily respond is pretty orthogonal to whether they’ve done something wrong. Like, if 80k is indeed doing something that merits a boycott, then saying so seems appropriate. There might be some debate about whether this is warranted given the facts, or even whether the facts are right, but it seems misguided to me to make the strength of an objection proportional to someone’s capacity to respond rather than to the badness of the thing they did.
This comment appears to respond to habryka, but doesn’t actually address what I took to be his two main points—that Anthropic was using NDAs to cover non-disparagement agreements, and that they were applying significant financial incentive to pressure employees into signing them.
We historically included standard non-disparagement agreements by default in severance agreements
Were these agreements subject to NDA? And were all departing employees asked to sign them, or just some? If the latter, what determined who was asked to sign?
Agreed. I'd be especially interested to hear this from people who have left Anthropic.
I do kind of share the sense that people mostly just want frisbee and tea, but I am still confused about it. Wasn't religion a huge deal for people for most of history? I could see a world where they were mostly just going through the motions, but the amount of awe I feel going into European churches feels like some evidence against this. And it's hard for me to imagine that people were kind of just mindlessly sitting there through e.g., Gregorian chanting in candlelight, but maybe I am typical minding too hard. It really seems like these rituals, the architecture, all of it, was built to instill the sort of existential intensity that taking God seriously requires, and I have to imagine that this was at least somewhat real for most people?
And I do wonder whether the shift towards frisbee and tea has more to do more with a lack of options as compelling as cathedrals (on this axis at least), rather than the people's lack of wanting it? Like, I don't think I would get as much out of cathedrals as I expect some of these people did, because I'm not religious, but if something of that intensity existed which fit with my belief system I feel like I'd be so into it.
Right now it seems like the entire community is jumping to conclusions based on a couple of "impressions" people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that's on you.
Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.
I agree with you that most people are not aiming for as much stringency with their commitments as rationalists expect. Separately, I do think that what Anthropic did would constitute a betrayal, even in everyday culture. And in any case, I think that when you are making a technology which might extinct humanity, the bar should be significantly higher than “normal discourse.” When you are doing something with that much potential for harm, you owe it to society to make firm commitments that you stick to. Otherwise, as kave noted, how are we supposed to trust your other “commitments”? Your RSP? If all you can offer are vague “we’ll figure it out when we get there,” then any ambiguous statement should be interpreted as a vibe, rather than a real plan. And in the absence of unambiguous statements, as all the labs have failed to provide, this is looking very much like “trust us, we’ll do the right thing.” Which, to my mind, is nowhere close to the assurances society ought to be provided given the stakes.
I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don't blame them for not doing so.
This reasoning seems to imply that Anthropic should only be obliged to convey information when the environment is sufficiently welcoming to them. But Anthropic is creating a technology which might extinct humanity—they have an obligation to share their reasoning regardless of what society thinks. In fact, if people are upset by their actions, there is more reason, not less, to set the record straight. Public scrutiny of companies, especially when their choices affect everyone, is a sign of healthy discourse.
The implicit bid for people not to discourage them—because that would make it less likely for a company to be forthright—seems incredibly backwards, because then the public is unable to mention when they feel Anthropic has made a mistake. And if Anthropic is attempting to serve the public, which they at least pay lip service to through their corporate structure, then they should be grateful for this feedback, and attempt to incorporate it.
So I do blame them for not making such a statement—it is on them to show to humanity, the people they are making decisions for, why those decisions are justified. It is not on society to make the political situation sufficiently palatable such that they don’t face any consequences for the mistakes they have made. It is on them not to make those mistakes, and to own up to them when they do.
I have not heard from anyone who wasn’t released, and I think it is reasonably likely I would have heard from them anonymously on Signal. Also, not releasing a bunch of people after saying they would seems like an enormously unpopular, hard to keep secret, and not very advantageous move for OpenAI, which is already taking a lot of flak for this.
I’m not necessarily imagining that OpenAI failed to release a bunch of people, although that still seems possible to me. I’m more concerned that they haven’t released many key people, and while I agree that you might have received an anonymous Signal message to that effect if it were true, I still feel alarmed that many of these people haven’t publicly stated otherwise.
I also have a model of how people choose whether or not to make public statements where it’s extremely unsurprising most people would not choose to do so.
I do find this surprising. Many people are aware of who former OpenAI employees are, and hence are aware of who was (or is) bound by this agreement. At the very least, if I were in this position, I would want people to know that I was no longer bound. And it does seem strange to me, if the contract has been widely retracted, that so few prominent people have confirmed being released.
It also seems pretty important to figure out who is under mutual non-disparagement agreements with OpenAI, which would still (imo) pose a problem if it applied to anyone in safety evaluations or policy positions.
I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause
I could imagine it being the case that people are prepared to ignore the contract. But unless they publicly state that, it wouldn’t ameliorate my concerns—otherwise how is anyone supposed to trust they will?
The clause is also open to more avenues of legal attack if it's enforced against someone who takes another position which requires disparagement (e.g. if it's argued to be a restriction on engaging in business).
That seems plausible, but even if this does increase the likelihood that they’d win a legal battle, legal battles still pose huge risk and cost. This still seems like a meaningful deterrent.
I don't think it's fair to view this as a serious breach of trust on behalf of any individual, without clear evidence that it impacted their decisions or communication.
But how could we even get this evidence? If they’re bound to the agreement their actions just look like an absence of saying disparaging things about OpenAI, or of otherwise damaging their finances or reputation. And it’s hard to tell, from the outside, whether this is a reflection of an obligation, or of a genuine stance. Positions of public responsibility require public trust, and the public doesn’t have access to the inner workings of these people’s minds. So I think it’s reasonable, upon finding out that someone has a huge and previously-undisclosed conflict of interest, to assume that might be influencing their behavior.
I agree this is usually the case, but I think it’s not always true, and I don’t think it’s necessarily true here. E.g., people as early as Da Vinci guessed that we’d be able to fly long before we had planes (or even any flying apparatus which worked). Because birds can fly, and so we should be able to as well (at least, this was Da Vinci and the Wright brothers' reasoning). That end point was not dependent on details (early flying designs had wings like a bird, a design which we did not keep :p), but was closer to a laws of physics claim (if birds can do it there isn’t anything fundamentally holding us back from doing it either).
Superintelligence holds a similar place in my mind: intelligence is physically possible, because we exhibit it, and it seems quite arbitrary to assume that we’ve maxed it out. But also, intelligence is obviously powerful, and reality is obviously more manipulable than we currently have the means to manipulate it. E.g., we know that we should be capable of developing advanced nanotech, since cells can, and that space travel/terraforming/etc. is possible.
These two things together—“we can likely create something much smarter than ourselves” and “reality can be radically transformed”—is enough to make me feel nervous. At some point I expect most of the universe to be transformed by agents; whether this is us, or aligned AIs, or misaligned AIs or what, I don’t know. But looking ahead and noticing that I don’t know how to select the “aligned AI” option from the set “things which will likely be able to radically transform matter” seems enough cause, in my mind, for exercising caution.
Bloomberg confirms that OpenAI has promised not to cancel vested equity under any circumstances, and to release all employees from one-directional non-disparagement agreements.
They don't actually say "all" and I haven't seen anyone confirm that all employees received this email. It seems possible (and perhaps likely) to me that many high profile safety people did not receive this email, especially since it would presumably be in Sam's interest to do so, and since I haven't seen them claiming otherwise. And we wouldn't know: those who are still under the contract can't say anything. If OpenAI only sent an email to some former employees then they can come away with headlines like "OpenAI releases former staffers from agreement" which is true, without giving away their whole hand. Perhaps I'm being too pessimistic, but I am under the impression that we're dealing with a quite adversarial player, and until I see hard evidence otherwise this is what I'm assuming.
Why do you think this? The power that I'm primarily concerned about is the power to pause, and I'm quite skeptical that companies like Amazon and Google would be willing to invest billions of dollars in a company which may decide to do something that renders their investment worthless. I.e, I think a serious pause, one on the order of months or years, is essentially equivalent to opting out of the race to AGI. On this question, my strong prior is that investors like Google and Amazon have more power than employees or the trust, else they wouldn't invest.
"So God can’t make the atoms be arranged one way and the humans be arranged another contradictory way."
But couldn't he have made a different sort of thing than humans, which were less prone to evil? Like, it seems to me that he didn't need to make us evolve through the process of natural selection, such that species were always in competition, status was a big deal, fighting over mates commonplace, etc. I do expect that there's quite a bit of convergence in the space of possible minds—even if one is selecting them from the set of "all possible atomic configurations of minds"—but I would still guess that not all of those are as prone to "evil" as us. I.e., if the laws of physics were held constant, I would think you could get less evil things than us out of it, and probably worlds which were overall more favorable to life (fewer natural disasters, etc.). But perhaps this is even more evidence that God only cares about the laws of physics? Since we seem much more like an afterthought than a priority?
Secondly, following Dennett, the point of modeling cognitive systems according to the intentional stance is that we evaluate them on a behavioral basis and that is all there is to evaluate.
I am confused on this point. Several people have stated that Dennett believes something like this, e.g., Quintin and Nora argue that Dennett is a goal "reductionist," by which I think they mean something like "goal is the word we use to refer to certain patterns of behavior, but it's not more fundamental than that."
But I don't think Dennett believes this. He's pretty critical of behaviorism, for instance, and his essay Skinner Skinned does a good job, imo, of showing why this orientation is misguided. Dennett believes, I think, that things like "goals," "beliefs," "desires," etc. do exist, just that we haven't found the mechanistic or scientific explanation of them yet. But he doesn't think that explanations of intention will necessarily bottom out in just their outward behavior, he expects such explanations to make reference to internal states as well. Dennett is a materialist, so of course at the end of the day all explanations will be in terms of behavior (inward or outward), on some level, much like any physical explanation is. But that's a pretty different claim from "mental states do not exist."
I'm also not sure if you're making that claim here or not, but curious if you disagree with the above?
I don't know what Katja thinks, but for me at least: I think AI might pose much more lock-in than other technologies. I.e., I expect that we'll have much less of a chance (and perhaps much less time) to redirect course, adapt, learn from trial and error, etc. than we typically do with a new technology. Given this, I think going slower and aiming to get it right on the first try is much more important than it normally is.
I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.
Dennett meant a lot to me, in part because he’s shaped my thinking so much, and in part because I think we share a kindred spirit—this ardent curiosity about minds and how they might come to exist in a world like ours. I also think he is an unusually skilled thinker and writer in many respects, as well as being an exceptionally delightful human. I miss him.
In particular, I found his deep and persistent curiosity beautiful and inspiring, especially since it’s aimed at all the (imo) important questions. He has a clarity of thought which manages to be both soft and precise, and a robust ability to detect and avoid bullshit. His book Intuition Pumps and Other Tools for Thinking is full of helpful cognitive strategies, many of which I’ve benefited from, and many of which have parallels in the Sequences. You can just tell that he’s someone in love with minds and the art of thinking, and that he’s actually trying at it.
But perhaps the thing I find most inspiring about him, the bit which I most want to emulate, is that he doesn’t shy away from the difficult questions—consciousness, intentionality, what are real patterns, how can we tell if a system understanding something, etc—but he does so without any lapse in intellectual rigor. He’s always aiming at operationalization and gears-level understanding, but he’s careful to check for whether mechanistic models in fact correspond to the higher level he’s attempting to address. He doesn’t let things be explained away, but he also doesn’t let things remain mysterious. He’s deeply committed to a materialistic understanding of the world which permits of minds.
In short, he holds the same mysteries that I do, I think, of how thinking things could come to exist in a world made out of atoms, and he’s committed, as I am, to naturalizing such mysteries in a satisfying way.
He’s also very clear about the role of philosophy in science: it’s the process of figuring out what the right questions even are, such that one can apply the tools of science to answer them. I think he’s right, both that this is the role of good philosophy and that we’re all pretty confused about what the right questions of mind are. I think he did an excellent job of narrowing the confusion, which is a really fucking cool and admirable thing to spend a life on. But the work isn’t done. In many ways, I view my research as picking up where he left off—the quest for a satisfying account of minds in a materialistic, deterministic world. Now that he’s passed, I realize that I really wanted him to see that. I wanted to show him my work. I feel like part of the way I was connected to the world has been severed, and I am feeling grief about that.
I’ve learned so much from Dennett. How to think better, how to hold my curiosity better, how to love the mind, and how to wonder productively about it. I feel like the world glows dimmer now than it did before, and I feel that grief—the blinking out of this beautiful light. But it is also a good time to reflect on all that he’s done for the world, and all that he’s done for me. He is really a part of me, and I feel the love and the gratitude for what he’s brought into my life.
Aw man, this is so exciting! There’s something really important to me about rationalist virtues having a home in the world. I’m not sure if what I’m imagining is what you’re proposing, exactly, but I think most anything in this vicinity would feel like a huge world upgrade to me.
Apparently I have a lot of thoughts about this. Here are some of them, not sure how applicable they are to this project in particular. I think you can consider this to be my hopes for what such a thing might be like, which I suspect shares some overlap.
It has felt to me for a few years now like something important is dying. I think it stems from the seeming inevitability of what’s before us—the speed of AI progress, our own death, the death of perhaps everything—that looms, shadow-like. And it’s scary to me, and sad, because “inevitability” is a close cousin of “defeat,” and I fear the two inch closer all the time.
It’s a fatalism that creeps in slow, but settles thick. And it lurks, I think, in the emotional tenor of doom that resides beneath nominally probabilistic estimates of our survival. Lurks as well, although much more plainly, within AI labs: AGI is coming whether we want it to or not, pausing is impossible, the invisible hand holds the reins, or as Claude recently explained to me, “the cat is already out of the bag.” And I think this is sometimes intentional—we are supposed to think about labs in terms of the overwhelming incentives, more than we are supposed to think about them as composed of agents with real choice, because that dispossesses them of responsibility, and dispossesses us of the ability to change them.
There is a similar kind of fatalism that often attaches to the idea of the efficient marketplace—that what is desired has already been done, that if one sits back and lets the machine unfold it will arrive at all the correct conclusions itself. There is no room, in that story, for genuinely novel ideas or progress, all forward movement is the result of incremental accretions on existing structures. This sentiment looms in academia as well—that there is nothing fundamental or new left to uncover, that all low hanging fruit has been plucked. Academic aims rarely push for all that could be—progress is instead judged relatively, the slow inching away from what already is.
And I worry this mentality is increasingly entrenching itself within AI safety, too. That we are moving away from the sort of ambitious science that I think we need to achieve the world that glows—the sort that aims at absolute progress—and instead moving closer to an incremental machine. After all, MIRI tried and failed to develop agent foundations so maybe we can say, “case closed?” Maybe “solving alignment” was never the right frame in the first place. Maybe it always was that we needed to do the slow inching away from the known, the work that just so happens not to challenge existing social structures. There seems to me, in other words, to be a consensus closing in: new theoretical insights are unlikely to emerge, let alone to have any real impact on engineering. And unlikelier, still, to happen in time.
I find all of this fatalism terribly confused. Not only because it has, I think, caused people to increasingly depart from the theoretical work which I believe is necessary to reach the world that glows, but because it robs us of our agency. The closer one inches towards inevitability, the further one inches away from the human spirit having any causal effect in the world. What we believe is irrelevant, what is good and right is irrelevant; the grooves have been worn, the structures erected—all that’s left is for the world to follow course. We cannot simply ask people to do what’s right, because they apparently can’t. We cannot succeed at stopping what is wrong, because the incentives are too strong to be opposed. All we can do, it seems, is to meld with the structure itself, making minor adjustments on the margin.
And there’s a feeling I get, sometimes, when I look at all of this, as if a tidal wave were about to engulf me. The machine has a life of its own; the world is moved by forces outside of my control. And it scares me, and I feel small. But then I remember that it’s wrong.
There was a real death, I think, that happened when MIRI leadership gave up on solving alignment, but we haven’t yet held the funeral. I think people carry that—the shadow of the fear, unnamed but tangible: that we might be racing towards our inevitable death, that there might not be much hope, that the grooves have been worn, the structures erected, and all that’s left is to give ourselves away as we watch it all unravel. It’s not a particularly inspiring vision, and in my opinion, not a particularly correct one. The future is built out of our choices; they matter, they are real. Not because it would be nice to believe it, but because it is macroscopically true. If one glances at history, it’s obvious that ideas are powerful, that people are powerful. The incentives do not dictate everything, the status quo is never the status quo for very long. The future is still ours to decide. And it’s our responsibility to do so with integrity.
I have a sense that this spirit has been slipping, with MIRI leadership largely admitting defeat, with CFAR mostly leaving the scene, with AI labs looming increasingly large within the culture and the discourse. I don’t want it to. I want someone to hold the torch of rationality and all its virtues, to stay anchored on what is true and good amidst a landscape of rapidly changing power dynamics, to fight for what’s right with integrity, to hold a positive vision for humanity. I want a space for deep inquiry and intellectual rigor, for aiming at absolute progress, for trying to solve the god damn problem. I think Lightcone has a good shot at doing a fantastic job of bringing something like this to life, and I’m very exited to see what comes of this!
Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the core unless you deeply care about the core yourself. In any case, I think this skill carries over to object-level research, e.g., he often seems, to me, to ask cutting-to-the core type questions there, too. I also think he's great at argument: legible reasoning, identifying the important cruxes in conversations, etc., all of which makes it easier to tell the bullshit from the not.
I do not think of Oliver as being afraid to be disagreeable, and ime he gets to the heart of things quite quickly, so much so that I found him quite startling to interact with when we first met. And although I have some disagreements over Oliver's past walled-garden taste, from my perspective it's getting better, and I am increasingly excited about him being at the helm of a project such as this. Not sure what to say about his beacon-ness, but I do think that many people respect Oliver, Lightcone, and rationality culture more generally; I wouldn't be that surprised if there were an initial group of independent researcher types who were down and excited for this project as is.
This is very cool! I’m excited to see where it goes :)
A couple questions (mostly me grappling with what the implications of this work might be):
- Given a dataset of sequences of tokens, how do you find the HMM that could have generated it, and can this be done automatically? Also, is the mapping from dataset to HMM unique?
- This question is possibly more confused on my end, sorry if so. I’m trying to get at something like “how interpretable will these simplexes be with much larger models?” Like, if I’m imagining that each state is a single token, and the HMM is capable of generating the totality of data the model sees, then I’m imagining something quite unwieldy, i.e., something with about the amount of complexity and interpretability as, e.g., the signaling cascade networks in a cell. Is this imagination wrong? Or is it more like, you start with this unwieldy structure (but which has some nice properties nonetheless), and then from there you try to make the initial structure more parse-able? Maybe a more straightforward way to ask: you say you’re interested in formalizing things like situational awareness with these tools—how might that work?
Something feels very off to me about these kinds of speciest arguments. Like the circle of moral concern hasn’t expanded, but imploded, rooting out the very center from which it grew. Yes, there is a sense in which valuing what I value is arbitrary and selfish, but concluding that I should completely forego what I value seems pretty alarming to me, and I would assume, to most other humans who currently exist.
I guess I'm not sure what you mean by "most scientific progress," and I'm missing some of the history here, but my sense is that importance-weighted science happens proportionally more outside of academia. E.g., Einstein did his miracle year outside of academia (and later stated that he wouldn't have been able to do it, had he succeeded at getting an academic position), Darwin figured out natural selection, and Carnot figured out the Carnot cycle, all mostly on their own, outside of academia. Those are three major scientists who arguably started entire fields (quantum mechanics, biology, and thermodynamics). I would anti-predict that future scientific progress, of the field-founding sort, comes primarily from people at prestigious universities, since they, imo, typically have some of the most intense gatekeeping dynamics which make it harder to have original thoughts.
I don’t see how the cluster argument resolves the circularity problem.
The circularity problem, as I see it, is that your definition of an abstraction shouldn’t be dependent on already having the abstraction. I.e., if the only way to define the abstraction “dog” involves you already knowing the abstraction “dog” well enough to create the set of all dogs, then probably you’re missing some of the explanation for abstraction. But the clusters in thingspace argument also depends on having an abstraction—knowing to look for genomes, or fur, or bark, is dependent on us already understanding what dogs are like. After all, there are nearly infinite “axes” one could look at, but we already know to only consider some of them. In other words, it seems like this has just passed the buck from choice of object to choice of properties, but you’re still making that choice based on the abstraction.
The fact that choice of axis—from among the axes we already know to be relevant—is stable (i.e., creates the same clusterings) feels like a central and interesting point about abstractions. But it doesn’t seem like it resolves the circularity problem.
(In retrospect the rest of this comment is thinking-out-loud for myself, mostly :p but you might find it interesting nonetheless).
I think it’s hard to completely escape this problem—we need to use some of our own concepts when understanding the territory, as we can’t see it directly—but I do think it’s possible to get a bit more objective than this. E.g., I consider thermodynamics/stat mech to be pretty centrally about abstractions, but it does so in a way that feels more “territory first,” if that makes any sense. Like, it doesn’t start with the conclusion. It started with the observation that “heat moves stuff” and “what’s up with that” and then eventually landed with an analysis of entropy involving macrostates. Somehow that progression feels more natural to me than starting with “dogs are things” and working backwards. E.g., I think I’m wanting something more like “if we understand these basic facts about the world, we can talk about dogs” rather than “if we start with dogs, we can talk sensibly about dogs.”
To be clear, I consider some of your work to be addressing this. E.g., I think the telephone theorem is a pretty important step in this direction. Much of the stuff about redundancy and modularity feels pretty tip-of-the-tongue onto something important, to me. But, at the very least, my goal with understanding abstractions is something like “how do we understand the world such that abstractions are natural kinds”? How do we find the joints such that, conditioning on those, there isn’t much room to vary? What are those joints like? The reason I like the telephone theorem is that it gives me one such handle: all else equal, information will dissipate quickly—anytime you see information persisting, it’s evidence of abstraction.
My own sense is that answering this question will have a lot more to do with how useful abstractions are, rather than how predictive/descriptive they are, which are related questions, but not quite the same. E.g., with the gears example you use to illustrate redundancy, I think the fact that we can predict almost everything about the gear from understanding a part of it is the same reason why the gear is useful. You don’t have to manipulate every atom in the gear to get it to move, you only have to press down on one of the… spokes(?), and the entire thing will turn. These are related properties. But they are not the same. E.g., you can think about the word “stop” as an abstraction in the sense that many sound waves map to the same “concept,” but that’s not very related to why the sound wave is so useful. It’s useful because it fits into the structure of the world: other minds will do things in response to it.
I want better ways to talk about how agents get work out of their environments by leveraging abstractions. I think this is the reason we ultimately care about them ourselves; and why AI will too. I also think it’s a big part of how we should be defining them—that the natural joint is less “what are the aggregate statistics of this set” but more “what does having this information allow us to do”?
I think it’s pretty unlikely that Anthropic’s murky strategy is good.
In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.
It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence.
But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.
The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me.
I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose.
Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks.
Maybe you really do need to iterate on frontier AI to do meaningful safety work.
This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now.
Maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not.
It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.
In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory.
Does Dario-and-other-leadership have good models of x-risk?
I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it's trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does.
But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?"
I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on.
Things like their RSP rely on being upheld in spirit, not only in letter.
This is something I’m worried about. I think that Anthropic’s current RSP is vague and/or undefined on many crucial points. For instance, I feel pretty nervous about Anthropic’s proposed response to an evaluation threshold triggering. One of the first steps is that they will “conduct a thorough analysis to determine whether the evaluation was overly conservative,” without describing what this “thorough analysis” is, nor who is doing it.
In other words, they will undertake some currently undefined process involving undefined people to decide whether it was a false alarm. Given how much is riding on this decision—like, you know, all of the potential profit they’d be losing if they concluded that the model was in fact dangerous—it seems pretty important to be clear about how these things will be resolved.
Instituting a policy like this is only helpful insomuch as it meaningfully constrains the company’s behavior. But when the response to evaluations are this loosely and vaguely defined, it’s hard for me to trust that the RSP cashes out to more than a vague hope that Anthropic will be careful. It would be nice to feel like the Long Term Benefit Trust provided some kind of assurance against this. But even this seems difficult to trust when they’ve added “failsafe provisions” that allow a “sufficiently large” supermajority of stockholders to make changes to the Trust’s powers (without the Trustees consent), and without saying what counts as “sufficiently large.”
It seems plausible that this scenario could happen, i.e., that Anthropic and OpenAI end up in a stable two-player oligopoly. But I would still be pretty surprised if Anthropic's pitch to investors, when asking for billions of dollars in funding, is that they pre-commit to never release a substantially better product than their main competitor.
I agree that this is a plausible read of their pitch to investors, but I do think it’s a bit of a stretch to consider it the most likely explanation. It’s hard for me to believe that Anthropic would receive billions of dollars in funding if they're explicitly telling investors that they’re committing to only release equivalent or inferior products relative to their main competitor.
I assign >50% that Anthropic will at some point pause development for at least six months as a result of safety evaluations.
I believed, prior to the Claude 3 release, that Anthropic had committed to not meaningfully push the frontier.
I believed, prior to the Claude 3 release, that Anthropic had implied they were not going to meaningfully push the frontier.
I currently believe that Anthropic is planning to meaningfully push the frontier.
I currently believe that Anthropic previously committed to not meaningfully push the frontier.
I assign >10% that Anthropic will at some point completely halt development of AI, and attempt to persuade other organizations to as well (i.e., “sound the alarm.”)