Posts

Towards Quantitative AI Risk Management 2024-10-16T19:26:48.817Z
simeon_c's Shortform 2024-04-04T09:01:48.921Z
Forecasting future gains due to post-training enhancements 2024-03-08T02:11:57.228Z
Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis 2024-02-01T21:30:44.090Z
A Brief Assessment of OpenAI's Preparedness Framework & Some Suggestions for Improvement 2024-01-22T20:08:57.250Z
Responsible Scaling Policies Are Risk Management Done Wrong 2023-10-25T23:46:34.247Z
Do LLMs Implement NLP Algorithms for Better Next Token Predictions? 2023-09-19T12:28:45.660Z
In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence? 2023-09-16T10:44:46.459Z
AGI x Animal Welfare: A High-EV Outreach Opportunity? 2023-06-28T20:44:25.836Z
The Cruel Trade-Off Between AI Misuse and AI X-risk Concerns 2023-04-22T13:49:02.124Z
AI Takeover Scenario with Scaled LLMs 2023-04-16T23:28:14.004Z
Navigating AI Risks (NAIR) #1: Slowing Down AI 2023-04-14T14:35:40.395Z
Request to AGI organizations: Share your views on pausing AI progress 2023-04-11T17:30:46.707Z
Could Simulating an AGI Taking Over the World Actually Lead to a LLM Taking Over the World? 2023-01-13T06:33:35.860Z
[Linkpost] DreamerV3: A General RL Architecture 2023-01-12T03:55:29.931Z
Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers? 2022-12-31T11:34:18.185Z
AGI Timelines in Governance: Different Strategies for Different Timeframes 2022-12-19T21:31:25.746Z
Extracting and Evaluating Causal Direction in LLMs' Activations 2022-12-14T14:33:05.607Z
Is GPT3 a Good Rationalist? - InstructGPT3 [2/2] 2022-04-07T13:46:58.255Z
New GPT3 Impressive Capabilities - InstructGPT3 [1/2] 2022-03-13T10:58:46.326Z

Comments

Comment by simeon_c (WayZ) on Common misconceptions about OpenAI · 2024-12-07T22:33:45.018Z · LW · GW

250 upvotes is also crazy high. Another sign of the disastrous abilities of EA/LessWrong communities at character judgment. 

The same is right now happening before our eyes on Anthropic. And similar crowds are as confidently asserting that this time they're really the good guys.

Comment by simeon_c (WayZ) on Should there be just one western AGI project? · 2024-12-04T20:18:48.060Z · LW · GW

I just skimmed but just wanted to flag that I like Bengio's proposal of one coordinated coalition that develops several AGIs in a coordinated fashion (e.g. training runs at the same time on their own clusters), which decreases the main downside of having one single AGI project (power concentration). 

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2024-12-04T20:15:50.460Z · LW · GW

I still agree with a lot of that post and am still essentially operating on it. 

I also think that it's interesting to read the comments because at the time the promise of those who thought my post was wrong was that Anthropic's RSP would get better and that this was only the beginning. With RSP V2 being worse and less specific than RSP V1, it's clear that this was overoptimistic. 

Now, risk management in AI has also gone a lot more mainstream than it was a year ago, in large parts thanks to the UK AISI who started operating on it. People have also started using more probabilities, for instance in safety cases paper, which this post advocated for.  

With SaferAI, my organization, we're still continuing to work on moving the field closer from traditional risk management and ensuring that we don't reinvent the wheel when there's no need to. There should be releases going in that direction over the coming months. 

Overall, if I look back on my recommendations, I think they're still quite strong. "Make the name less misleading" hasn't been executed on but other names than RSPs have started being used, such as Frontier AI Safety Commitments, which is a strong improvement from my "Voluntary safety commitments" suggestion.

My recommendation about what RSPs are and aren't are also solid. My worry that the current commitments in RSPs would be pushed in policy was basically right: it's been used in many policy conversations as an anchor for what to do and what not to do. 


Finally, the push for risk management in policy that I wanted to see happen has mostly happened. This is great news. 

The main thing that misses from this post is the absence of prediction of RSP launching the debate about what should be done and at what levels. This is overall a good effect which has happened, and would probably have happened several months after if not for the publication of RSPs. The fact that it was done in a voluntary commitment context is unfortunate, because it levels down everything, but I still think this effect was significant.

Comment by simeon_c (WayZ) on Daniel Kokotajlo's Shortform · 2024-10-16T16:58:46.780Z · LW · GW

I'd be interested in also exploring model-spec-style aspirational documents too.

Happy to do a call on model-spec-style aspirational documents if it's any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.

Comment by simeon_c (WayZ) on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming · 2024-10-12T21:42:40.608Z · LW · GW

Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin "assurance properties" the research directions that are helpful for this particular problem. 


Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor. 

The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it's still very early, was debate. 

Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors. 

Comment by simeon_c (WayZ) on Advice for journalists · 2024-10-10T02:24:01.515Z · LW · GW

This article fails to account for the fact that abiding by the rules suggested would mostly kill the ability of journalists to share the most valuable information they share with the public.

You don't get to reveal stuff from the world most powerful organizations if you double check the quotes with them.

I think journalism is one of the professions where the consequentialist vs deontological ethics have the toughest trade-offs. It's just really hard to abide by very high privacy standards and broke highly important news.

As one illustrative example, your standard would have prevented Kelsey Piper from sharing her conversation with SBF. Is that a desirable outcome? Not sure.

Comment by simeon_c (WayZ) on abstractapplic's Shortform · 2024-09-15T17:13:42.638Z · LW · GW

Personally I use a mix of heuristics based on how important the new idea is, how rapid it is and how painful it will be to execute it in the future once the excitement dies down.

The more ADHD you are and the more the "burst of inspired-by-a-new-idea energy" effect is strong, so that should count. 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-09-14T08:12:01.375Z · LW · GW

do people have takes on the most useful metrics/KPIs that could give a sense of how good are the monitoring/anti-misuse measures on APIs?

Some ideas: 
a) average time to close an account conducting misuse activities (my sense is that as long as this is >1 day, there's little chance to avoid that state actors use API-based models for a lot of misuse (everything which doesn't require major scale))

b) the logs of the 5 accounts/interactions that have been ranked as highest severity (my sense is that incident reporting like OpenAI/Microsoft have done on cyber is very helpful to get a better mental model of what's up/how bad things are going)

c) Estimate of the number of users having meaningful jailbroken interactions per month (in absolute value, to give a sense of how much people are misusing the models through API).


A lot of the open source worry has been implicitly assuming that it would be easier to use OS than closed source, but it's unclear the extent to which it's already the case and I'm looking for metrics that give some insight into that. My sense is that the misuse that will require more scale will likely rely more on OS but those who are more in the infohazard realm (e.g. chembio) would be done best through APIs.

Comment by simeon_c (WayZ) on Would catching your AIs trying to escape convince AI developers to slow down or undeploy? · 2024-08-26T20:24:20.559Z · LW · GW

This looks to be overwhelmingly the most likely in my opinion and I'm glad someone wrote this post. Thanks Buck

Comment by simeon_c (WayZ) on Neel Nanda's Shortform · 2024-07-13T15:12:48.235Z · LW · GW

Thanks for answering, that's very useful. 

My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren't policy experts and don't really know what's going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I've heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful. 

Somehow though, internal employees keep deferring to their policy team and don't update on that part/take their beliefs seriously. 

I'd generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.

If it's true, it is probably true to an epsilon degree, and it might be wrong because of weird preferences of a non-safety industry actor.  AFAIK,  Anthropic has been pushing against all the AI regulation proposals to date. I've still to hear a positive example.

Comment by simeon_c (WayZ) on Neel Nanda's Shortform · 2024-07-12T19:29:50.507Z · LW · GW

How aware were you (as an employee) & are you (now) of their policy work? In a world model where policy is the most important stuff, it seems to me like it could tarnish very negatively Anthropic's net impact.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-06-30T07:58:10.123Z · LW · GW

This is the best alignment plan I've heard in a while.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-06-29T08:29:49.433Z · LW · GW

You are a LessWrong reader, want to push humanity's wisdom and don't know how to do so? Here's a workflow: 

  1. Pick an important topic where the entire world is confused 
  2. Post plausible sounding takes with a confident tone on it
  3. Wait for Gwern's comment on your post 
  4. Problem solved

See an application of the workflow here: https://www.lesswrong.com/posts/epgCXiv3Yy3qgcsys/you-can-t-predict-a-game-of-pinball?commentId=wjLFhiWWacByqyu6a

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-06-16T14:19:57.023Z · LW · GW

Playing catch-up is way easier than pushing the frontier of LLM research. One is about guessing which path others took, the other one is about carving a path among all the possible ideas that could work.

If China stopped having access to US LLM secrets and had to push the LLM frontier rather than playing catch up, how slower would it be at doing so? 

My guess is at least >2x and probably more but I'd be curious to get takes. 

Comment by simeon_c (WayZ) on Non-Disparagement Canaries for OpenAI · 2024-05-30T21:04:05.933Z · LW · GW

Great initiative! Thanks for leading the charge on this.

Comment by simeon_c (WayZ) on AI companies aren't really using external evaluators · 2024-05-29T19:19:12.622Z · LW · GW

Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-05-26T07:13:09.111Z · LW · GW

Thanks for the answer it makes sense.

To be clear I saw it thanks to Matt who did this tweet so credit goes to him: https://x.com/SpacedOutMatt/status/1794360084174410104?t=uBR_TnwIGpjd-y7LqeLTMw&s=19

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-05-25T21:39:26.023Z · LW · GW

Lighthaven City for 6.6M€? Worth a look by the Lightcone team.

https://x.com/zillowgonewild/status/1793726646425460738?t=zoFVs5LOYdSRdOXkKLGh4w&s=19

Comment by simeon_c (WayZ) on Daniel Kokotajlo's Shortform · 2024-05-24T20:03:21.746Z · LW · GW

Thanks for sharing. It's both disturbing from a moral perspective and fascinating to read.

Comment by simeon_c (WayZ) on AI companies aren't really using external evaluators · 2024-05-24T19:50:48.189Z · LW · GW

Very important point that wasn't on my radar. Thanks a lot for sharing.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-05-24T06:41:24.902Z · LW · GW

So first the 85% net worth thing went quite viral several times and made Daniel Kokotajlo a bit of a heroic figure on Twitter.

Then Kelsey Piper's reporting pushed OpenAI to give back Daniel's vested units. I think it's likely that Kelsey used elements from this discussion as initial hints for her reporting and plausible that the discussion sparked her reporting, I'd love to have her confirmation or denial on that.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-05-23T20:46:48.588Z · LW · GW

I'm not gonna lie, I'm pretty crazily happy that a random quick take I wrote 10m on a Friday morning about how Daniel Kokotajlo should get social reward and get partial refunding sparked a discussion that seems to have caused positive effects wayyyy beyond expectations.

Quick takes is an awesome innovation, it allows to post even when one is still partially confused/uncertain about sthg. Given the confusing details of the situation in that case, this wd pbbly not have happened otherwise.

Comment by simeon_c (WayZ) on Stephen Fowler's Shortform · 2024-05-21T07:21:38.783Z · LW · GW

Mhhh, that seems very bad for someone in an AISI in general. I'd guess Jade Leung might sadly be under the same obligations... 

That seems like a huge deal to me with disastrous consequences, thanks a lot for flagging.

Comment by simeon_c (WayZ) on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-14T05:55:14.014Z · LW · GW

Right.  Thanks for putting the full context.  Voluntary commitments refers to the WH commitments which are much narrower than the PF so I think my observation holds.

Comment by simeon_c (WayZ) on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-14T05:47:13.045Z · LW · GW

Agreed. Note that they don't say what Martin claim they say, but they only say

We’ve evaluated GPT-4o according to our Preparedness Framework

I think it's reasonably likely to imply that they broke all their non-evaluation PF commitments, while not being technically wrong. 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-05-10T05:56:34.294Z · LW · GW

Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?

I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-12T17:21:47.777Z · LW · GW

I mean the full option space obviously also includes "bargain with Russia and China to make credible commitments that they stop rearming (possibly in exchange for something)", and I think we should totally explore that path aswell, I just don't have much hope in it at this stage which is why I'm focusing on the other option, even if it is a fucked up local nash equilibrium. 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-12T13:31:31.112Z · LW · GW

I've been thinking a lot recently about taxonomizing AI risk related concepts to reduce the dimensionality of AI threat modelling while remaining quite comprehensive. It's in the context of developing categories to assess whether labs plans cover various areas of risk.

There are two questions I'd like to get takes on. Any take on one of these 2 wd be very valuable.

  1. In the misalignment threat model space, a number of safety teams tend to assume that the only type of goal misgeneralization that could lead to X-risks is deceptive misalignment. I'm not sure to understand where that confidence comes from. Could anyone make or link to a case that rules out the plausibility of all other forms of goal misgeneralization? 
  2. It seems to me that to minimize the dimensionality of the threat modelling, it's sometimes more useful to think about the threat model (e.g. a terrorist misuses an LLM to develop a bioweapon) and sometimes more useful to think about a property which has many downstream consequences on the level of risk. I'd like to get takes on one such property:
    1. Situational awareness: It seems to me that it's most useful to think of this property as its own hazard which has many downstream consequences on the level of risk (most prominently that a model with it can condition on being tested when completing tests). Do you agree or disagree with this take? Or would you rather discuss situational awareness only in the context of the deceptive alignment threat model?
Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-11T20:55:15.238Z · LW · GW

Rephrasing based on an ask: "Western Democracies need to urgently put a hard stop to Russia and China war (preparation) efforts" -> Western Democracies need to urgently take actions to stop the current shift towards a new World order where conflicts are a lot more likely due to Western democracies no longer being a hegemonic power able to crush authoritarians power that grab land etc. This shift is currently primarily driven by the fact that Russia & China are heavily rearming themselves whereas Western democracies are not.

@Elizabeth

Comment by simeon_c (WayZ) on How did you integrate voice-to-text AI into your workflow? · 2024-04-10T22:48:35.322Z · LW · GW

I liked this extension (https://chrome.google.com/webstore/detail/whispering/oilbfihknpdbpfkcncojikmooipnlglo), which I use for long messages. I press a shortcut, it starts recording with Whisper, then repress and it puts the transcript in my clipboard.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-10T20:36:36.074Z · LW · GW

In those, Ukraine committed to pass laws for Decentralisation of power, including through the adoption of the Ukrainian law "On temporary Order of Local Self-Governance in Particular Districts of Donetsk and Luhansk Oblasts". Instead of Decentralization they passed laws forbidding those districts from teaching children in the languages that those districts wants to teach them. 

Ukraines unwillingness to follow the agreements was a key reason why the invasion in 2022 happened and was very popular with the Russian population

I ignored that, that's useful, thank you. 

My (simple) reasoning is that I pattern matched hard to the Anschluss (https://en.wikipedia.org/wiki/Anschluss) as a prelude to WW2 where democracies accepted a first conquest hoping that it would stop there (spoiler: it didn't). 

Minsk really much feels the same way. From the perspetive of democracies it seems kinda reasonable to try one time a peaceful resolution accepting a conquest and see if Putin stops (although in hindsight it's unreasonable to not prepare to the possibility he doesn't). Now that he started invading Ukraine as a whole, it seems really hard for me to believe "once he'll get Ukraine, he'll really stop". I expect many reasons to invade other adjacent countries to come up aswell.

The latest illegal land grab was done by Israel without any opposition by the US. If you are truly worried about land grabs being a problem why not speak against that US position of being okay with some land grabs instead of just speaking for buying more weapons?

Two things on this. 

  1. Object-level: I'm not ok with this. 
  2. At a meta-level, there's a repugnant moral dilemma fundamental to this:
    1. The American hegemonic power was abused, e.g. see https://en.wikipedia.org/wiki/July_12,_2007,_Baghdad_airstrike or a number of wars that the US created for dubious reasons (i.e. usually some economic or geostrategic interests). (same for France, I'm just focusing on the US here for simplicity)
    2. Still, despite those deep injustice, the 2000s have been the least lethal in interstate conflicts because hegemony with threat of being crushed by the great power disincentivizes heavily anyone to fight. 
      1. It seems to me that hegemony of some power or coalition of powers is the most stable state for that reason. So I find this state quite desirable.
    3. Then the other question is, who should be in that position?
      1. I have the chance to be able to write this about my country without ending up in jail for. And if I do end up in jail, I have higher odds than in most other countries to be able to contest it. 
      2. So, although western democracies are quite bad and repugnant in a bunch of ways, I find them the least worse and most beneficial existing form of political power to currently defend and preserve the hegemony of.
Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-10T20:13:13.637Z · LW · GW

Indeed. One consideration is that the LW community used to be much less into policy adjacent stuff and hence much less relevant on that domain. Now, with AI governance becoming an increasingly big deal, I think we could potentially use some of that presence to push for certain things in defense. 

Pushing for things in the genre of what Noah describes in the first piece I shared seems feasible for some people in policy.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-10T15:43:59.323Z · LW · GW

Idk what the LW community can do but somehow, to the extent we think liberalism is valuable, the Western democracies need to urgently put a hard stop to Russia and China war (preparation) efforts. I fear that rearmament is a key component of the only viable path at this stage.

I won't argue in details here but link to Noahpinion, who's been quite vocal on those topics. The TLDR is that China and Russia have been scaling their war industry preparation efforts for years, while Western democracies industries keep declining and remain crazily dependent from the Chinese industry. This creates a new global equilibrium where the US is no longer powerful enough to disincentivize all authoritarians regime from grabbing more land etc.

Some readings relevant to that:

I know this is not a core LW theme but to the extent this threat might be existential to liberalism, and to the existence of LW as a website in the first place, I think we should all care. It would also be quite terrible for safety if AGI was developed during a global war, which seems uncomfortably likely (~10% imo).

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-08T21:33:41.398Z · LW · GW

If you wanna reread the debate, you can scroll through this thread (https://x.com/bshlgrs/status/1764701597727416448). 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-08T18:21:29.546Z · LW · GW

There was a hot debate recently but regardless, the bottom line is just "RSPs should probably be interpreted literally and nothing else. If a literal statement is not strictly there, it should be assumed it's not a commitment."

I've not seen people doing very literal interpretation on those so I just wanted to emphasize that point.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-08T12:27:54.773Z · LW · GW

Given the recent argument on whether Anthropic really did commit to not push the frontier or just misled most people into thinking that it was the case, it's relevant to reread the RSPs in hairsplitting mode. I was rereading the RSPs and noticed a few relevant findings:

Disclaimer: this is focused on negative stuff but does not deny the merits of RSPs etc etc.

  1. I couldn't find any sentence committing to not significantly increase extreme risks. OTOH I found statements that if taken literally could imply an implicit acknowledgment of the opposite: "our most significant immediate commitments include a high standard of security for ASL-3 containment, and a commitment not to deploy ASL-3 models until thorough red-teaming finds no risk of catastrophe.". 
    Note that it makes a statement on the risk only bearing on deployment measures and not on security. Given that the lack security is probably the biggest source of risk of ASL-3 systems & the biggest weakness of RSPs, I find it pretty likely that this is not random.
  2. I found a number of commitments that are totally unenforceable in hairsplitting mode. Here are two examples: 
    1. "World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse." 
      1. The use of five underdefined adjectives + "significantly" is a pretty safe barrier against any enforcement.
    2. "When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g. 50% of the time in which catastrophic harm could realistically occur)."
      1. The combination of "or", of the characterization of promptly as "50% of the time, the use of "e.g." and of "realistically" is also a safe barrier against enforceability. 
  3. It's only my subjective judgment here and you don't have to trust it but I also found Core Views on AI Safety to have a number of similar patterns.
Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-04T09:01:49.177Z · LW · GW

There's a number of properties of AI systems that makes it easier to collect information in a safe way about those systems and hence demonstrate their safety: interpretability, formal verifiability, modularity etc. Which adjective wd you use to characterize those properties?

 

I'm thinking of "resilience" because from the perspective of an AI developer it helps a lot understanding the risk profile, but do you have other suggestions? 

Some alternatives: 

  1. auditability properties
  2. legibility properties
Comment by simeon_c (WayZ) on Vote on Anthropic Topics to Discuss · 2024-03-06T23:57:27.176Z · LW · GW

Unsure how much we disagree Zach and Oliver so I'll try to quantify: I would guess that Claude 3 will cut release date of next gen models from OpenAI by a few months at least (I would guess 3 months), which has significant effects on timelines.

Tentatively, I'm thinking that this effect may be surlinear. My model is that each new release increases the speed of development (bc of increased investment in all the value chain including compute + realization from people that it's not like other technologies etc) and so that a few months now causes more than a few months on AGI timelines.

Comment by WayZ on [deleted post] 2024-02-11T17:28:56.631Z

Oh thanks, I hadn't find it, gonna delete!

Comment by simeon_c (WayZ) on Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis · 2024-02-02T06:27:47.746Z · LW · GW

Yeah basically Davidad has not only a safety plan but a governance plan which actively aims at making this shift happen!

Comment by simeon_c (WayZ) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-26T17:46:39.653Z · LW · GW

Thanks for writing that. I've been trying to taboo "goals" because it creates so much confusion, which this post tries to decrease. In line with this post, I think what matters is how difficult a task is to achieve, and what it takes to achieve it in terms of ability to overcome obstacles.

Comment by simeon_c (WayZ) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T13:44:53.371Z · LW · GW
Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-28T13:30:45.554Z · LW · GW

"Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL."

And/or = or, so I just want to flag that the actual commitment here cd be as weak as "we delay the deployment but keep scaling internally". If it's a mistake, you can correct it, but if it's not, it doesn't seem like a robust commitment to pause to me, even assuming that the conditions of pause were well established.

Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-28T12:13:18.480Z · LW · GW

Because it's meaningless to talk about a "compromise" dismissing one entire side of the people who disagree with you (but only one side!).

Like I could say "global compute thresholds is a robustly good compromise with everyone who disagrees with me"

*Footnote: only those who're more pessimistic than me.

Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-27T22:49:03.362Z · LW · GW

That may be right but then the claim is wrong. The true claim would be "RSPs seem like a robustly good compromise with people who are more optimistic than me".

And then the claim becomes not really relevant?

Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-27T20:56:39.397Z · LW · GW

Holden, thanks for this public post. 

  1. I would love if you could write something along the lines of what you wrote in "If it were all up to me, the world would pause now - but it isn’t, and I’m more uncertain about whether a “partial pause” is good" at the top of ARC post, which as we discussed and as I wrote in my post would make RSPs more likely to be positive in my opinion by making the policy/voluntary safety commitments distinction clearer.

Regarding 

Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine

2. It seems like it's empirically wrong based on the strong pushback RSPs received so that at least you shouldn't call it "robustly", unless you mean a kind of modified version that would accommodate the most important parts of the pushback. 

3. I feel like overall the way you discuss RSPs here is one of the many instances of people chatting about idealized RSPs that are not specified, and pointed to against disagreement. See below, from my post:

And second, the coexistence of ARC's RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:

  • “That’s the V1. We’ll raise ambition over time”. I’d like to see evidence of that happening over a 5 year timeframe, in any field or industry. I can think of fields, like aviation where it happened over the course of decades, crashes after crashes. But if it’s relying on expectations that there will be large scale accidents, then it should be clear. If it’s relying on the assumption that timelines are long, it should be explicit. 
  • “It’s voluntary, we can’t expect too much and it’s way better than what’s existing”. Sure, but if the level of catastrophic risks is 1% (which several AI risk experts I’ve talked to believe to be the case for ASL-3 systems) and that it gives the impression that risks are covered, then the name “responsible scaling” is heavily misleading policymakers. The adequate name for 1% catastrophic risks would be catastrophic scaling, which is less rosy.

Thanks for the post.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T18:45:16.459Z · LW · GW

Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."

Great question! A few points: 

  1. Yes, many of the things I point are "how to do things well" and I would in fact much prefer something that contains a section "we are striving towards that and our current effort is insufficient" than the current RSP communication which is more "here's how to responsibly scale". 
  2. That said, I think we disagree on the reference class of the effort (you say "a few years"). I think that you could do a very solid MVP of what I suggest with like 5 FTEs over 6 months. 
  3. As I wrote in "How to move forward" (worth skimming to understand what I'd change) I think that RSPs would be incredibly better if they: 
    1. had a different name
    2. said that they are insufficient
    3. linked to a post which says "here's the actual thing which is needed to make us safe". 
  4. Answer to your question: if I were optimizing in the paradigm of voluntary lab commitments as ARC is, yes I would much prefer that. I flagged early though that because labs are definitely not allies on this (because an actual risk assessment is likely to output "stop"), I think the "ask labs kindly" strategy is pretty doomed and I would much prefer a version of ARC trying to acquire bargaining power through a way or another (policy, PR threat etc.) rather than adapting their framework until labs accept to sign them. 

Regarding 

If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".

I don't think it's necessarily right, e.g. "the ISO standard asks the organization to define risk thresholds" could be a very simple task, much simpler than developing a full eval. The tricky thing is just to ensure we comply with such levels (and the inability to do that obviously reveals a lack of safety). 

"ISO proposes a much more comprehensive procedure than RSPs", it's not right either that it would take longer, it's just that there exists risk management tools, that you can run in like a few days, that helps having a very broad coverage of the scenario set.

"imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?" once again you can cover the most obvious things in like a couple pages. Writing "Maybe they would give the weights to their team of hackers, which increases substantially the chances of leak and global cyberoffence increase". And I would be totally fine with half-baked things if they were communicated as such and not as RSPs are.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T18:25:15.764Z · LW · GW

Two questions related to it: 

  1. What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it's extremely hard)?
  2. Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what's your guess of p(catastrophe)?
Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T14:17:35.363Z · LW · GW

Thanks Eli for the comment. 

One reason why I haven't provided much evidence is that I think it's substantially harder to give evidence of a "for all" claim (my side of the claim) than a "there exists" (what I ask Evan). I claim that it doesn't happen that a framework on a niche area evolves so fast without accidents based on what I've seen, even in domains with substantial updates, like aviation and nuclear.

I could potentially see it happening with large accidents, but I personally don't want to bet on that and I would want it to be transparent if that's the assumption. I also don't buy the "small coordinations allow larger coordinations" for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that's sufficiently good-looking to stakeholders to not have substantial incentives to change.

GDPR cookies banner sucks for everyone and haven't been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I'm talking about standards, not regulation), and we'll have to bargain to try to bring it down to reasonable timeframes AI-specific. 

IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we're talking about decades, not 5 years.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T11:59:00.322Z · LW · GW

Thanks for your comment. 
 

I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done.

I strongly disagree with this. In my opinion, a lot of the issue is that RSPs have been thought from first principles without much consideration for everything the risk management field has done, and hence doing wrong stuff without noticing. 

It's not a matter of how detailed they are; they get the broad principles wrong. As I argued (the entire table is about this) I think that the existing principles of other existing standards are just way better and so no, it's not a matter of details. 

As I said, the details & evals of RSPs is actually the one thing that I'd keep and include in a risk management framework. 

Honestly I can't think of anything much better that could have been reasonably done given the limited time and resources we all have

Well, I recommend looking at Section 3 and the source links. Starting from those frameworks and including evals into it would be a Pareto improvement.