Posts
Comments
It's a small but positive sign that Anthropic sees taking 3 days beyond their RSP's specified timeframe to conduct a process without a formal exception as an issue. Signals that at least some members of the team there are extremely attuned to normalization of deviance concerns.
I once saw a video on Instagram of a psychiatrist recommending to other psychiatrists that they purchase ear scopes to check out their patients' ears, because:
1. Apparently it is very common for folks with severe mental health issues to imagine that there is something in their ear (e.g., a bug, a listening device)
2. Doctors usually just say "you are wrong, there's nothing in your ear" without looking
3. This destroys trust, so he started doing cursory checks with an ear scope
4. Far more often than he expected (I forget exactly, but something like 10-20%ish), there actually was something in the person's ear -- usually just earwax buildup, but occasionally something else like a dead insect -- that was indeed causing the sensation, and he gained a clinical pathway to addressing his patients' discomfort that he had previously lacked
Looking forward to it! (Should rules permit, we're also happy to discuss privately at an earlier date)
Has anyone thought about the things that governments are uniquely good at when it comes to evaluating models?
Here are at least 3 things I think they have as benefits:
1. Just an independent 3rd-party perspective generally
2. The ability to draw insights across multiple labs' efforts, and identify patterns that others might not be able to
3. The ability to draw on classified threat intelligence to inform its research (e.g., Country X is using model Y for bad behavior Z) and to test the model for classified capabilities (bright line example: "can you design an accurate classified nuclear explosive lensing arrangement").
Are there others that come to mind?
I think this can be true, but I don't think it needs to be true:
"I expect that a lot of regulation about what you can and can’t do stops being enforceable once the development is happening in the context of the government performing it."
I suspect that if the government is running the at-all-costs-top-national-priority Project, you will see some regulations stop being enforceable. However, we also live in a world where you can easily find many instances of government officials complaining in their memoirs that laws and regulations prevented them from being able to go as fast or as freely as they'd want on top-priority national security issues. (For example, DoD officials even after 9-11 famously complained that "the lawyers" restricted them too much on top-priority counterterrorism stuff.)
Gentlemen, it's been a pleasure playing with you tonight
I suspect this won't get published until November at the earliest, but I am already delightfully pleased with this bit:
Canada geese fly overhead, honking. Your inner northeast Ohioan notices that you are confused; it’s the wrong season for them to migrate this far south, and they’re flying westwards, anyways.
A quick Google discovers that some Canada geese have now established themselves non-migratorily in the Bay Area:
"The Migratory Bird Treaty Act of 1918 banned hunting or the taking of eggs without a permit. These protections, combined with an increase in desirable real estate—parks, golf course and the like—spurred a dramatic turnaround for the species. Canada geese began breeding in the Bay Area—the southern end of their range – in the late 1950s."
You nod, approvingly; this clearly is another part of the East Bay’s well-known, long-term philanthropic commitment to mitigating Acher-Risks.
Preregistering intent to write "Every Bay Area Walled Compound" (hat tip: Emma Liddell)
Unrelatedly, I am in Berkeley through Wednesday afternoon, let me know if you're around and would like to chat
Generally, it is difficult to understate how completely the PRC is seen as a bad-faith actor in DC these days. Many folks saw them engage in mass economic espionage for a decade while repeatedly promising to stop; those folks are now more senior in their careers than those formative moments. Then COVID happened, and while not everyone believes in the lab leak hypothesis, basically everyone believes that the PRC sure as heck reflexively covered up whether or not they were actually culpable.
(Edit: to be clear, reporting, not endorsing, these claims)
Basic question because I haven't thought about this deeply: in national security stuff, we often intentionally elide the difference between capabilities and intentions. The logic is: you can't assume a capability won't be used, so you should plan as-if it is intended to be used.
Should we adopt such a rule for AGI with regards to policy decision-making? (My guess is...probably not for threat assessment but probably yes for contingency planning?)
I think, having been raised in a series of very debate- and seminar-centric discussion cultures, that a quick-hit question like that is indeed contributing something of substance. I think it's fair that folks disagree, and I think it's also fair that people signal (e.g., with karma) that they think "hey man, let's go a little less Socratic in our inquiry mode here."
But, put in more rationalist-centric terms, sometimes the most useful Bayesian update you can offer someone else is, "I do not think everyone is having the same reaction to your argument that you expected." (Also true for others doing that to me!)
(Edit to add two words to avoid ambiguity in meaning of my last sentence)
Yes, I would agree that if I expected a short take to have this degree of attention, I would probably have written a longer comment.
Well, no, I take that back. I probably wouldn't have written anything at all. To some, that might be a feature; to me, that's a bug.
It is genuinely a sign that we are all very bad at predicting others' minds that it didn't occur to me that if I said effectively "OP asked for 'takes', here's a take on why I think this is pragmatically a bad idea" would also mean that I was saying "and therefore there is no other good question here". That's, as the meme goes, a whole different sentence.
I think it's bad for discourse for us to pretend that discourse doesn't have impacts on others in a democratic society. And I think the meta-censoring of discourse by claiming that certain questions might have implicit censorship impacts is one of the most anti-rationality trends in the rationalist sphere.
I recognize most users of this platform will likely disagree, and predict negative agreement-karma on this post.
Ok, then to ask it again in your preferred question format: is this where we think our getting-potential-employees-of-Anthropic-to-consider-the-value-of-working-on-safety-at-Anthropic points are best spent?
Is this where we think our pressuring-Anthropic points are best spent ?
I personally endorse this as an example of us being a community that Has The Will To Try To Build Nice Things.
To say the obvious thing: I think if Anthropic isn't able to make at least somewhat-roughly-meaningful predictions about AI welfare, then their core current public research agendas have failed?
Possibly misguided question given the context -- I see you incorporating imperfect information in "the attack fails silently", why not also a distinction between "the attack succeeds noisily, the AI wins and we know it won" and "the attack succeeds silently, the AI wins and we don't know it won" ?
I would suggest that the set of means available to nation-states to unilaterally surveil another nation state is far more expansive than the list you have. For example, the good-old-fashioned "Paying two hundred and eighty-two thousand dollars in a Grand Cayman banking account to a Chinese bureaucrat"* appears nowhere in your list.
*If you get that this is a reference to the movie Spy Game, you are cool. If you don't, go watch Spy Game. It has a worldview on power that is extremely relevant to rationalists.
I think you could argue plausibly that the climax of Vernor Vinge's A Deepness In the Sky has aspects of this, though it's subverted in multiple interesting spoilery ways.
In fact, I think you could argue that a lot of Vinge's writing tends to have major climaxes dependent on Xanatos Gambit pileups based on deception themes.
This feels like a great theory for one motivation, but it isn't at all complete.
For example: this theory doesn't really predict why anyone is ever hired above the bottom level of an organization at the margin.
That's a fair criticism! Season 1 is definitely slower on that front compared to the others. I think season 1 is the most normal "crime of the week" season by far, which is why I view it as a good on-ramp for folks less familiar. Arguably, for someone situated as you are, you should just watch the pilot, read a quick wiki summary of every other episode in season 1 except for the last 2, watch those last 2, and move into season 2 when things get moving a little faster. (Finch needs a character that doesn't appear until season 2 to do a lot of useful exposition on how he thinks about the Machine's alignment).
I will continue to pitch on the idea that Person of Interest is a TV show chock full of extremely popular TV people, including one lead beloved by Republicans, and we inexplicably fail to push people towards its Actually Did The Research presentation of loss-of-control stuff.
We should do that. You all should, unironically, recommend it, streaming now on Amazon Prime for free, to your normie parents and aunts and uncles if they're curious about what you do at work.
Why are we so much more worried about LLMs having CBRN risk than super-radicalization risk, precisely ?
(or is this just a expected-harm metric rather than a probability metric ?)
I think you're eliding the difference between "powerful capabilities" being developed, the window of risk, and the best solution.
For example, if Anthropic believes "_we_ will have it internally in 1-3 years, but no small labs will, and we can contain it internally" then they might conclude that the warrant for a state-level FMD is low. Alternatively, you might conclude, "we will have it internally in 1-3 years, other small labs will be close behind, and an American state's capabilities won't be sufficient, we need DoD, FBI, and IC authorities to go stompy on this threat", and thus think a state-level FMD is low-value-add.
Very unsure I agree with either of these hypos to be clear! Just trying to explore the possibility space and point out this is complex.
I am (speaking personally) pleasantly surprised by Anthropic's letter. https://cdn.sanity.io/files/4zrzovbb/website/6a3b14a98a781a6b69b9a3c5b65da26a44ecddc6.pdf
I think at the meta level I very much doubt that I am responsible enough to create and curate a list of human beings for the most dangerous hazards. For example, I am very confident that I could not 100% successfully detect a foreign government spy inside my friend group, because even the US intelligence community can't do that... you need other mitigating controls, instead.
Yeah, that's a useful taxonomy to be reminded of. I think it's interesting how the "development hazard", item 8, with maybe a smidge of "adversary hazard", is the driver of people's thinking on AI. I'm pretty unconvinced that good infohazard doctrine, even for AI, can be written based on thinking mainly about that!
I think a lot of my underlying instinctive opposition to this concept boils down to thinking that we can and do coordinate on this stuff quite a lot. Arguably, AI is the weird counterexample of a thought that wants to be thunk -- I think modern Western society is very nearly tailor-made to seek a thing that is abstract, maximizing, systematizing of knowledge, and useful, especially if it fills a hole left by the collapse of organized religion.
I think for most other infohazards, the proper approach requires setting up an (often-government) team that handles them, which requires those employees to expose themselves to the infohazard to manage it. And, yeah, sometimes they suffer real damage from it. There's no way to analyze ISIS beheading videos to stop their perpetrators without seeing some beheading videos; I think that's the more-common varietal of infohazard I'm thinking of.
I think this is plausibly describing some folks!
But I also think there's a separate piece -- I observe, with pretty high odds that it isn't just an act, that at least some people are trying to associate themselves with the near-term harms and AI ethics stuff because they think that is the higher-status stuff, despite direct obvious evidence that the highest-status people in the room disagree.
I'm pretty sure that I think "infohazard" is a conceptual dead-end concept that embeds some really false understandings of how secrets are used by humans. It is an orphan of a concept -- it doesn't go anywhere. Ok, the information's harmful. You need humans to touch that info anyways to do responsible risk-mitigation. So now what ?
I just want to note that people who've never worked in a true high-confidentiality environment (professional services, national defense, professional services for national defense) probably radically underestimate the level of brain damage and friction that Zac is describing here:
"Imagine, if you will, trying to hold a long conversation about AI risk - but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around."
Confidentiality is really, really hard to maintain. Doing so while also engaging the public is terrifying. I really admire the frontier labs folks who try to engage publicly despite that quite severe constraint, and really worry a lot as a policy guy about the incentives we're creating to make that even less likely in the future.
One question I have is whether Nancy Pelosi was asked and agreed to do this, or whether Nancy Pelosi identified this proactively as an opportunity to try to win back some tech folks to the Dem side. Substantially changes our estimate of how much influence the labs have in this conversation.
I mean, I think it's worth doing an initial loose and qualitative discussion to make sure that you're thinking about overlapping spaces conceptually. Otherwise, not worth the more detailed effort.
Unsolicited suggestion: it is probably useful for y'all to define further what "pick a lock"means -- e.g., if someone builds a custom defeat device of some sort, that does some sort of activity that is non-destructive but engages in a mechanical operation very surprising to someone thinking of traditional lock-picking methods -- does that count?
(I think you'd probably say yes, so long as the device isn't, e.g., a robot arm that's nondestructively grabbing the master key for the lock out of Zac's pocket and inserting it into the lock, but some sort of definining-in-advance would likely help.)
Nonetheless, think this would be awesome as an open challenge at Defcon (I suspect you can convince them to Black Badge the challenge...)
Thanks, this is helpful! I also think this helps me understand a lot better what is intended to be different about @Buck 's research agenda from others, that I didn't understand previously.
So, I really, really am not trying to be snarky here but am worried this comment will come across this way regardless. I think this is actually quite important as a core factual question given that you've been around this community for a while, and I'm asking you in your capacity as "person who's been around for a minute". It's non-hyperbolically true that no one has published this sort of list before in this community?
I'm asking, because if that's the case, someone should, e.g., just write a series of posts that just marches through US government best-practices documents on these domains (e.g., Chemical Safety Board, DoD NISPOM, etc.) and draws out conclusions on AI policy.
I think I agree with much-to-all of this. One further amplification I'd make about the last point: the culture of DC policymaking is one where people are expected to be quick studies and it's OK to be new to a topic; talent is much more funged from topic to topic in response to changing priorities than you'd expect. Your Lesswrong-informed outside view of how much you need to know on a topic to start commenting on policy ideas is probably wrong.
(Yes, I know, someone is about to say "but what if you are WRONG about the big idea given weird corner case X or second-order effects Y?" Look, reversed stupidity is not wisdom, but also also sometimes you can just quickly identify stupid-across-almost-all-possible-worlds ideas and convince people just not to do them rather than having to advocate for an explicit good-idea alternative.)
Everyone assumes that it was named after Claude Shannon, but it appears they've never actually confirmed that.
(This is not an endorsement of Jim Caviezel's beliefs, in case anyone somehow missed my point here.)
I feel like one of the trivially most obvious signs that AI safety comms hasn't gone actually mainstream yet is that we don't say, "yeah, superintelligent AI is very risky. No, I don't mean Terminator. I'm thinking more Person of Interest, you know, that show with the guy from the Sound of Freedom and the other guy who was on Lost and Evil?"
Yup1 I think those are potentially very plausible, and similar things were on my short list of possible explanations. I would be very not shocked if those are the true reasons. I just don't think I have anywhere near enough evidence yet to actually conclude that, so I'm just reporting the random observation for now :)
A random observation from a think tank event last night in DC -- the average person in those rooms is convinced there's a problem, but that it's the near-term harms, the AI ethics stuff, etc. The highest-status and highest-rank people in those rooms seem to be much more concerned about catastrophic harms.
This is a very weird set of selection effects. I'm not sure what to make of it, honestly.
One item that I think I see missing from this list is what you might call "ritual" -- agreed-upon ways of knowing what to do in a given context that two members of an organization can have shared mental models of, whether or not they've worked together in the past. This allows you to scale trust by reducing the amount of trust needed to handle the same amount of ambiguity, at some loss of flexibility.
For example, when I was at McKinsey, calling a meeting a "problem solving" as opposed to a "status update" or a "steerco" would invoke three distinct sets of behaviors and expectations. As a result, each participant had some sense of what other participants might by-default expect the meeting to feel like and be, and so even if participants hadn't worked much with each other in the past, they could know how to act in a trust-building way in that meeting context. The flip side is that if the meeting needed something very different from the normal behaviors, it became slightly harder to break out of the default mode.
I would politely, but urgently suggest that if you're thinking a lot about scenarios where you could justify suicide, you might not be as interested in the scenarios as the permission you think they might give you. And you might not realize that! Motivated reasoning is a powerful force for folks who are feeling some mental troubles.
This is the sort of thing where checking in with a loved one generally about how they perceive your general affect and mood is a really good idea. I urge you to do that. You're probably fine and just playing with some abstract ideas, but why not check in with a loved one just in case?
One of the things I greatly enjoyed about this writeup is that it reminded me how much the "empty-plate" vibe was lovely and something I want to try to create more of in my own day-to-day.
Tangible specific action: I have been raving about how much I loved the Lighthaven supply cabinets. I literally just now purchased a set of organizers shaped for my own bookcases to be able to recreate a similar thing in my own home; thank you for your reminder that caused me to do this.
I would like to politely request that if you happen to have a chance to tell Leo's owner that Leo is clearly a very happy dog that feels loved, could you please do so on my behalf ?