Posts
Comments
Ok, so it seems clear that we are, for better or worse, likely going to try to get AGI to do our alignment homework.
Who has thought through all the other homework we might give AGI that is as good of an idea, assuming a model that isn't an instant-game-over for us? E.G., I remember @Buck rattling off a list of other ideas that he had in his The Curve talk, but I feel like I haven't seen the list of, e.g., "here are all the ways I would like to run an automated counterintelligence sweep of my organization" ideas.
(Yes, obviously, if the AI is sneakily misaligned, you're just dead because it will trick you into firing all your researchers, etc.; this is written in a "playing to your outs" mentality, not an "I endorse this as a good plan" mentality.)
Huh? "fighting election misinformation" is not a sentence on this page as far as I can tell. And if you click through to the election page, you will see that the elections content is them praising a bipartisan bill backed by some of the biggest pro-Trump senators.
Without commenting on any strategic astronomy and neurology, it is worth noting that "bias", at least, is a major concern of the new administration (e.g., the Republican chair of the House Financial Services Committee is actually extremely worried about algorithmic bias being used for housing and financial discrimination and has given speeches about this).
I am not a fan, but it is worth noting that these are the issues that many politicians bring up already, if they're unfamiliar with the more catastrophic risks. Only one missing on there is job loss. So while this choice by OpenAI sucks, it sort of usefully represents a social fact about the policy waters they swim in.
I am (sincerely!) glad that this is obvious to other people too and that they are talking about it already!
I mean, the literal best way to incentivize @Ricki Heicklen and me to do this again for LessOnline and Manifest 2025 is to create a prediction market on it, so I encourage you to do that
One point that maybe someone's made, but I haven't run across recently: if you want to turn AI development into a Manhattan Project, you will by-default face some real delays from the reorganization of private efforts into one big national effort. In a close race, you might actually see pressures not to do so, because you don't want to give up 6 months to a year on reorg drama -- so in some possible worlds, the Project is actually a deceleration move in the short term, even if it accelerates in the long term!
Ooh, interesting, thank you!
Incidentally, spurred by @Mo Putera's posting of Vernor Vinge's A Fire Upon The Deep annotations, I want to remind folks that Vinge's Rainbows End is very good and doesn't get enough attention, and will give you a less-incorrect understanding of how national security people think.
Oh, fair enough then, I trust your visibility into this. Nonetheless one Should Can Just Report Bugs
Note for posterity that there has been at least $15K of donations since this got turned back on -- You Can Just Report Bugs
Ok, but you should leave the donation box up -- link now seems to not work? I bet there would be at least several $K USD of donations from folks who didn't remember to do it in time.
I think you're missing at least one strategy here. If we can get folks to agree that different societies can choose different combos, so long as they don't infringe on some subset of rights to protect other societies, then you could have different societies expand out into various pieces of the future in different ways. (Yes, I understand that's a big if, but it reduces the urgency/crux nature of value agreement).
Note that the production function of the 10x really matters. If it's "yeah, we get to net-10x if we have all our staff working alongside it," it's much more detectable than, "well, if we only let like 5 carefully-vetted staff in a SCIF know about it, we only get to 8.5x speedup".
(It's hard to prove that the results are from the speedup instead of just, like, "One day, Dario woke up from a dream with The Next Architecture in his head")
Basic clarifying question: does this imply under-the-hood some sort of diminishing returns curve, such that the lab pays for that labor until it net reaches as 10x faster improvement, but can't squeeze out much more?
And do you expect that's a roughly consistent multiplicative factor, independent of lab size? (I mean, I'm not sure lab size actually matters that much, to be fair, it seems that Anthropic keeps pace with OpenAI despite being smaller-ish)
For the record: signed up for a monthly donation starting in Jan 2025. It's smaller than I'd like given some financial conservatism until I fill out my taxes, may revisit it later.
Everyone who's telling you there aren't spoilers in here is well-meaning, but wrong. But to justify why I'm saying that is also spoilery, so to some degree you have to take this on faith.
(Rot13'd for those curious about my justification: Bar bs gur znwbe cbvagf bs gur jubyr svp vf gung crbcyr pna, vs fhssvpvragyl zbgvingrq, vasre sne zber sebz n srj vfbyngrq ovgf bs vasbezngvba guna lbh jbhyq anviryl cerqvpg. Vs lbh ner gryyvat Ryv gung gurfr ner abg fcbvyref V cbyvgryl fhttrfg gung V cerqvpg Nfzbqvn naq Xbein naq Pnevffn jbhyq fnl lbh ner jebat.)
Opportunities that I'm pretty sure are good moves for Anthropic generally:
- Open an office literally in Washington, DC, that does the same work that any other Anthropic office does (i.e., NOT purely focused on policy/lobbying, though I'm sure you'd have some folks there who do that). If you think you're plausibly going to need to convince policymakers on critical safety issues, having nonzero numbers of your staff that are definitively not lobbyists being drinking or climbing gym buddies that get called on the "My boss needs an opinion on this bill amendment by tomorrow, what do you think" roster is much more important than your org currently seems to think!
- Expand on recent efforts to put more employees (and external collaborators on research) in front of cameras as the "face" of that research -- you folks frankly tend to talk in ways that tend to be compatible with national security policymakers' vibes. (E.G., Evan and @Zac Hatfield-Dodds both have a flavor of the playful gallows humor that pervades that world). I know I'm a broken record on this but I do think it would help.
- Do more to show how the RSP affects its daily work (unlike many on this forum, I currently believe that they are actually Trying to Use The Policy and had many line edits as a result of wrestling with v1.0's minor infelicities). I understand that it is very hard to explain specific scenarios of how it's impacted day-to-day work without leaking sensitive IP or pointing people in the direction of potentially-dangerous things. Nonetheless, I think Anthropic needs to try harder here. It's, like...it's like trying to understand DoD if they only ever talked about the "warfighter" in the most abstract terms and never, like, let journalists embed with a patrol on the street in Kabul or Baghdad.
- Invest more in DC policymaker education outside of the natsec/defense worlds you're engaging already -- I can't emphasize enough how many folks in broad DC think that AI is just still a scam or a fad or just "trying to destroy art". On the other hand, people really have trouble believing that an AI could be "as creative as" a human -- the sort of Star Trek-ish "Kirk can always outsmart the machine" mindset pervades pretty broadly. You want to incept policymaking elites more broadly so that they are ready as this scales up.
Opportunities that I feel less certain about, but in the spirit of brainstorming:
- Develop more proactive, outward-facing detection capabilities to see if there are bad AI models out there. I don't mean red-teaming others' models, or evals, or that sort of thing. I mean, think about how you would detect if Anthropic had bad (misaligned or aligned-but-being-used-for-very-impactful-bad-things) models out there if you were at an intelligence agency without official access to Anthropic's models and then deploy those capabilities against Anthropic, and the world broadly.[1] You might argue that this is sort of an inverted version of @Buck's control agenda -- instead of trying to make it difficult for a model to escape, think about what facts about the world are likely to be true if a model has escaped, and then go looking for those.
- If it's not already happening, have Dario and other senior Anthropic leaders meet with folks who had to balance counterintelligence paranoia with operational excellence (e.g., leaders of intelligence agencies, for whom the standard advice to their successor is, "before you go home every day, ask 'where's the spy[2]'") so that they have a mindset on how to scale up his paranoia over time as needed
- Something something use cases -- Use case-based-restrictions are popular in some policy spheres. Some sort of research demonstrating that a model that's designed for and safe for use case X can easily be turned into a misaligned tool for use case Y under a plausible usage scenario might be useful?
Reminder/disclosure: as someone who works in AI policy, there are worlds where some of these ideas help my self-interest; others harm it. I'm not going to try to do the math on which are which under all sorts of complicated double-bankshot scenarios, though.
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the "we need more people working on this, make it happen" button.
In an ideal world (perhaps not reasonable given your scale), you would have some sort of permissions and logging against some sensitive types of queries on DM metadata. (E.G., perhaps you would let any Lighthaven team member see on the dashboard "rate of DMs from accounts <1 month in age compared to historic baseline" aggregate number, but "how many DMs has Bob (an account over 90 days old) sent to Alice" would require more guardrails.
Edit: to be clear, I am comfortable with you doing this without such logging at your current scale and think it is reasonable to do so.
I have a few weeks off coming up shortly, and I'm planning on spending some of it monkeying around AI and code stuff. I can think of two obvious tacks: 1. Go do some fundamentals learning on technical stuff I don't have hands-on technical experience with or 2. go build on new fun stuff.
Does anyone have particular lists of learning topics / syllabi / similar things like that that would be a good fit for "fairly familiar with the broad policy/technical space, but his largest shipped chunk of code is a few hundred lines of python" person like me?
Note also that this work isn't just papers; e.g., as a matter of public record MIRI has submitted formal comments to regulators to inform draft regulation based on this work.
(For those less familiar, yes, such comments are indeed actually weirdly impactful in the American regulatory system).
In a hypothetical, bad future where we have to do VaccinateCA 2.0 against e.g. bird flu, I personally wonder if "aggressively help people source air filters" would be a pre-vaccine-distribution-time step we would consider. (Not canon! Might be very wrong! Just idle musing)
Also, I would generally volunteer to help with selling Lighthaven as an event venue to boring consultant things that will give you piles of money, and IIRC Patrick Ward is interested in this as well, so please let us know how we can help.
I am excited for this grounds of "we deserve to have nice things," though for boring financial planning reasons I am not sure whether I will donate additional funds prior to calendar year end or in calendar year 2025.
(Note that I made a similar statement in the past and then donated $100 to Lighthaven very shortly thereafter, so, like, don't attempt to reverse-engineer my financial status from this or whatever.)
I think I'm also learning that people are way more interested in this detail than I expected!
I debated changing it to "203X" when posting to avoid this becoming the focus of the discussion but figured, "eh, keep it as I actually wrote it in the workshop" for good epistemic hygiene.
Oh, it very possibly is the wrongest part of the piece! I put it in the original workshop draft as I was running out of time and wanted to provoke debate.
A brief gesture at a sketch of the intuition: imagine a different, crueler world, where there were orders of magnitude more nation-states, but at the start only a few nuclear powers, like in our world, with a 1950s-level tech base. If the few nuclear powers want to keep control, they'll have to divert huge chunks of their breeder reactors' output to pre-emptively nuking any site in the many many non-nuclear-club states that could be an arms program to prevent breakouts, then any of the nuclear powers would have to wait a fairly long time to assemble an arms stockpile sufficient to launch a Project Orion into space.
As you know, I have huge respect for USG natsec folks. But there are (at least!) two flavors of them: 1) the cautious, measure-twice-cut-once sort that have carefully managed deterrencefor decades, and 2) the "fuck you, I'm doing Iran-Contra" folks. Which do you expect will get in control of such a program ? It's not immediately clear to me which ones would.
I think this is a (c) leaning (b), especially given that we're doing it in public. Remember, the Manhattan Project was a highly-classified effort and we know it by an innocuous name given to it to avoid attention.
Saying publicly, "yo, China, we view this as an all-costs priority, hbu" is a great way to trigger a race with China...
But if it turned out that we knew from ironclad intel with perfect sourcing that China was already racing (I don't expect this to be the case), then I would lean back more towards (c).
I'll be in Berkeley Weds evening through next Monday, would love to chat with, well, basically anyone who wants to chat. (I'll be at The Curve Fri-Sun, so if you're already gonna be there, come find me there between the raindrops!)
Thanks, looking forward to it! Please do let us folks who worked on A Narrow Path (especially me, @Tolga , and @Andrea_Miotti ) know if we can be helpful in bouncing around ideas as you work on the treaty proposal!
Is there a longer-form version with draft treaty langugage (even an outline)? I'd be curious to read it.
I think people opposing this have a belief that the counterfactual is "USG doesn't have LLMs" instead of "USG spins up its own LLM development effort using the NSA's no-doubt-substantial GPU clusters".
Needless to say, I think the latter is far more likely.
I think the thing that you're not considering is that when tunnels are more prevalent and more densely packed, the incentives to use the defensive strategy of "dig a tunnel, then set off a very big bomb in it that collapses many tunnels" gets far higher. It wouldn't always be infantry combat, it would often be a subterranean equivalent of indirect fires.
Ok, so Anthropic's new policy post (explicitly NOT linkposting it properly since I assume @Zac Hatfield-Dodds or @Evan Hubinger or someone else from Anthropic will, and figure the main convo should happen there, and don't want to incentivize fragmenting of conversation) seems to have a very obvious implication.
Unrelated, I just slammed a big AGI-by-2028 order on Manifold Markets.
Yup. The fact that the profession that writes the news sees "I should resign in protest" as their own responsibility in this circumstance really reveals something.
At LessOnline, there was a big discussion one night around the picnic tables with @Eliezer_Yudkovsky , @habryka , and some interlocutors from the frontier labs (you'll momentarily see why I'm being vague on the latter names).
One question was: "does DC actually listen to whistleblowers?" and I contributed that, in fact, DC does indeed have a script for this, and resigning in protest is a key part of it, especially ever since the Nixon years.
Here is a usefully publicly-shareable anecdote on how strongly this norm is embedded in national security decision-making, from the New Yorker article "The U.S. Spies Who Sound the Alarm About Election Interference" by David Kirkpatrick, Oct 21, 2024:
(https://archive.ph/8Nkx5)
The experts’ chair insisted that in this cycle the intelligence agencies had not withheld information “that met all five of the criteria”—and did not risk exposing sources and methods. Nor had the leaders’ group ever overruled a recommendation by the career experts. And if they did? It would be the job of the chair of the experts’ group to stand up or speak out, she told me: “That is why we pick a career civil servant who is retirement-eligible.” In other words, she can resign in protest.
Does "highest status" here mean highest expertise in a domain generally agreed by people in that domain, and/or education level, and/or privileged schools, and/or from more economically powerful countries etc?
I mean, functionally all of those things. (Well, minus the country dynamic. Everyone at this event I talked to was US, UK, or Canadian, which is all sorta one team for purposes of status dynamics at that event)
I was being intentionally broad, here. I am probably less interested for purposes of this particular post only in the question of "who controls the future" swerves and more about "what else would interested, agentic actors do" questions.
It is not at all clear to me that OpenPhil is the only org who feels this way -- I can think of several non-EA-ish charities that if they genuinely 100% believed "none of the people you care for will die of the evils you fight if you can just keep them alive for the next 90 days" would plausibly do some interestingly agentic stuff.
Oh, to be clear I'm not sure this is at all actually likely, but I was curious if anyone had explored the possibility conditional on it being likely
Basic Q: has anyone written much down about what sorts of endgame strategies you'd see just-before-ASI from the perspective of "it's about to go well, and we want to maximize the benefits of it" ?
For example: if we saw OpenPhil suddenly make a massive push to just mitigate mortality at the cost of literally every other development goal they have, I might suspect that they suspect that we're about to all be immortal under ASI, and they're trying to get as many people possible to that future...
yup, as @sanxiyn says, this already exists. Their example is, AIUI, a high-end research one; an actually-on-your-laptop-right-now, but admittedly more narrow example is address space layout randomization.
Wild speculation: they also have a sort of we're-watching-but-unsure provision about cyber operations capability in their most recent RSP update. In it, they say in part that "it is also possible that by the time these capabilities are reached, there will be evidence that such a standard is not necessary (for example, because of the potential use of similar capabilities for defensive purposes)." Perhaps they're thinking that automated vulnerability discovery is at least plausibly on-net-defensive-balance-favorable*, and so they aren't sure it should be regulated as closely, even if in still in some informal sense "dual use" ?
Again, WILD speculation here.
*A claim that is clearly seen as plausible by, e.g., the DARPA AI Grand Challenge effort.
It seems like the current meta is to write a big essay outlining your opinions about AI (see, e.g., Gladstone Report, Situational Awareness, various essays recently by Sam Altman and Dario Amodei, even the A Narrow Path report I co-authored).
Why do we think this is the case?
I can imagine at least 3 hypotheses:
1. Just path-dependence; someone did it, it went well, others imitated
2. Essays are High Status Serious Writing, and people want to obtain that trophy for their ideas
3. This is a return to the true original meaning of an essay, under Montaigne, that it's an attempt to write thinking down when it's still inchoate, in an effort to make it more comprehensible not only to others but also to oneself. And AGI/ASI is deeply uncertain, so the essay format is particularly suited for this.
What do you think?
Okay, I spent much more time with the Anthropic RSP revisions today. Overall, I think it has two big thematic shifts for me:
1. It's way more "professionally paranoid," but needs even more so on non-cyber risks. A good start, but needs more on being able to stop human intelligence (i.e., good old fashioned spies)
2. It really has an aggressively strong vibe of "we are actually using this policy, and We Have Many Line Edits As A Result." You may not think that RSPs are sufficient -- I'm not sure I do, necessarily -- but I am heartened slightly that they genuinely seem to take the RSP seriously to the point of having mildly-frustrated-about-process-hiccup footnoes about it. (Free advice to Anthropic PR: interview a bunch of staff about this on camera, cut it together, and post it, it will be lovely and humanizing and great recruitment material, I bet).
It's a small but positive sign that Anthropic sees taking 3 days beyond their RSP's specified timeframe to conduct a process without a formal exception as an issue. Signals that at least some members of the team there are extremely attuned to normalization of deviance concerns.
I once saw a video on Instagram of a psychiatrist recommending to other psychiatrists that they purchase ear scopes to check out their patients' ears, because:
1. Apparently it is very common for folks with severe mental health issues to imagine that there is something in their ear (e.g., a bug, a listening device)
2. Doctors usually just say "you are wrong, there's nothing in your ear" without looking
3. This destroys trust, so he started doing cursory checks with an ear scope
4. Far more often than he expected (I forget exactly, but something like 10-20%ish), there actually was something in the person's ear -- usually just earwax buildup, but occasionally something else like a dead insect -- that was indeed causing the sensation, and he gained a clinical pathway to addressing his patients' discomfort that he had previously lacked
Looking forward to it! (Should rules permit, we're also happy to discuss privately at an earlier date)
Has anyone thought about the things that governments are uniquely good at when it comes to evaluating models?
Here are at least 3 things I think they have as benefits:
1. Just an independent 3rd-party perspective generally
2. The ability to draw insights across multiple labs' efforts, and identify patterns that others might not be able to
3. The ability to draw on classified threat intelligence to inform its research (e.g., Country X is using model Y for bad behavior Z) and to test the model for classified capabilities (bright line example: "can you design an accurate classified nuclear explosive lensing arrangement").
Are there others that come to mind?
I think this can be true, but I don't think it needs to be true:
"I expect that a lot of regulation about what you can and can’t do stops being enforceable once the development is happening in the context of the government performing it."
I suspect that if the government is running the at-all-costs-top-national-priority Project, you will see some regulations stop being enforceable. However, we also live in a world where you can easily find many instances of government officials complaining in their memoirs that laws and regulations prevented them from being able to go as fast or as freely as they'd want on top-priority national security issues. (For example, DoD officials even after 9-11 famously complained that "the lawyers" restricted them too much on top-priority counterterrorism stuff.)