Posts

simeon_c's Shortform 2024-04-04T09:01:48.921Z
Forecasting future gains due to post-training enhancements 2024-03-08T02:11:57.228Z
Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis 2024-02-01T21:30:44.090Z
A Brief Assessment of OpenAI's Preparedness Framework & Some Suggestions for Improvement 2024-01-22T20:08:57.250Z
Responsible Scaling Policies Are Risk Management Done Wrong 2023-10-25T23:46:34.247Z
Do LLMs Implement NLP Algorithms for Better Next Token Predictions? 2023-09-19T12:28:45.660Z
In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence? 2023-09-16T10:44:46.459Z
AGI x Animal Welfare: A High-EV Outreach Opportunity? 2023-06-28T20:44:25.836Z
The Cruel Trade-Off Between AI Misuse and AI X-risk Concerns 2023-04-22T13:49:02.124Z
AI Takeover Scenario with Scaled LLMs 2023-04-16T23:28:14.004Z
Navigating AI Risks (NAIR) #1: Slowing Down AI 2023-04-14T14:35:40.395Z
Request to AGI organizations: Share your views on pausing AI progress 2023-04-11T17:30:46.707Z
Could Simulating an AGI Taking Over the World Actually Lead to a LLM Taking Over the World? 2023-01-13T06:33:35.860Z
[Linkpost] DreamerV3: A General RL Architecture 2023-01-12T03:55:29.931Z
Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers? 2022-12-31T11:34:18.185Z
AGI Timelines in Governance: Different Strategies for Different Timeframes 2022-12-19T21:31:25.746Z
Extracting and Evaluating Causal Direction in LLMs' Activations 2022-12-14T14:33:05.607Z
Is GPT3 a Good Rationalist? - InstructGPT3 [2/2] 2022-04-07T13:46:58.255Z
New GPT3 Impressive Capabilities - InstructGPT3 [1/2] 2022-03-13T10:58:46.326Z

Comments

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-12T17:21:47.777Z · LW · GW

I mean the full option space obviously also includes "bargain with Russia and China to make credible commitments that they stop rearming (possibly in exchange for something)", and I think we should totally explore that path aswell, I just don't have much hope in it at this stage which is why I'm focusing on the other option, even if it is a fucked up local nash equilibrium. 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-12T13:31:31.112Z · LW · GW

I've been thinking a lot recently about taxonomizing AI risk related concepts to reduce the dimensionality of AI threat modelling while remaining quite comprehensive. It's in the context of developing categories to assess whether labs plans cover various areas of risk.

There are two questions I'd like to get takes on. Any take on one of these 2 wd be very valuable.

  1. In the misalignment threat model space, a number of safety teams tend to assume that the only type of goal misgeneralization that could lead to X-risks is deceptive misalignment. I'm not sure to understand where that confidence comes from. Could anyone make or link to a case that rules out the plausibility of all other forms of goal misgeneralization? 
  2. It seems to me that to minimize the dimensionality of the threat modelling, it's sometimes more useful to think about the threat model (e.g. a terrorist misuses an LLM to develop a bioweapon) and sometimes more useful to think about a property which has many downstream consequences on the level of risk. I'd like to get takes on one such property:
    1. Situational awareness: It seems to me that it's most useful to think of this property as its own hazard which has many downstream consequences on the level of risk (most prominently that a model with it can condition on being tested when completing tests). Do you agree or disagree with this take? Or would you rather discuss situational awareness only in the context of the deceptive alignment threat model?
Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-11T20:55:15.238Z · LW · GW

Rephrasing based on an ask: "Western Democracies need to urgently put a hard stop to Russia and China war (preparation) efforts" -> Western Democracies need to urgently take actions to stop the current shift towards a new World order where conflicts are a lot more likely due to Western democracies no longer being a hegemonic power able to crush authoritarians power that grab land etc. This shift is currently primarily driven by the fact that Russia & China are heavily rearming themselves whereas Western democracies are not.

@Elizabeth

Comment by simeon_c (WayZ) on How did you integrate voice-to-text AI into your workflow? · 2024-04-10T22:48:35.322Z · LW · GW

I liked this extension (https://chrome.google.com/webstore/detail/whispering/oilbfihknpdbpfkcncojikmooipnlglo), which I use for long messages. I press a shortcut, it starts recording with Whisper, then repress and it puts the transcript in my clipboard.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-10T20:36:36.074Z · LW · GW

In those, Ukraine committed to pass laws for Decentralisation of power, including through the adoption of the Ukrainian law "On temporary Order of Local Self-Governance in Particular Districts of Donetsk and Luhansk Oblasts". Instead of Decentralization they passed laws forbidding those districts from teaching children in the languages that those districts wants to teach them. 

Ukraines unwillingness to follow the agreements was a key reason why the invasion in 2022 happened and was very popular with the Russian population

I ignored that, that's useful, thank you. 

My (simple) reasoning is that I pattern matched hard to the Anschluss (https://en.wikipedia.org/wiki/Anschluss) as a prelude to WW2 where democracies accepted a first conquest hoping that it would stop there (spoiler: it didn't). 

Minsk really much feels the same way. From the perspetive of democracies it seems kinda reasonable to try one time a peaceful resolution accepting a conquest and see if Putin stops (although in hindsight it's unreasonable to not prepare to the possibility he doesn't). Now that he started invading Ukraine as a whole, it seems really hard for me to believe "once he'll get Ukraine, he'll really stop". I expect many reasons to invade other adjacent countries to come up aswell.

The latest illegal land grab was done by Israel without any opposition by the US. If you are truly worried about land grabs being a problem why not speak against that US position of being okay with some land grabs instead of just speaking for buying more weapons?

Two things on this. 

  1. Object-level: I'm not ok with this. 
  2. At a meta-level, there's a repugnant moral dilemma fundamental to this:
    1. The American hegemonic power was abused, e.g. see https://en.wikipedia.org/wiki/July_12,_2007,_Baghdad_airstrike or a number of wars that the US created for dubious reasons (i.e. usually some economic or geostrategic interests). (same for France, I'm just focusing on the US here for simplicity)
    2. Still, despite those deep injustice, the 2000s have been the least lethal in interstate conflicts because hegemony with threat of being crushed by the great power disincentivizes heavily anyone to fight. 
      1. It seems to me that hegemony of some power or coalition of powers is the most stable state for that reason. So I find this state quite desirable.
    3. Then the other question is, who should be in that position?
      1. I have the chance to be able to write this about my country without ending up in jail for. And if I do end up in jail, I have higher odds than in most other countries to be able to contest it. 
      2. So, although western democracies are quite bad and repugnant in a bunch of ways, I find them the least worse and most beneficial existing form of political power to currently defend and preserve the hegemony of.
Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-10T20:13:13.637Z · LW · GW

Indeed. One consideration is that the LW community used to be much less into policy adjacent stuff and hence much less relevant on that domain. Now, with AI governance becoming an increasingly big deal, I think we could potentially use some of that presence to push for certain things in defense. 

Pushing for things in the genre of what Noah describes in the first piece I shared seems feasible for some people in policy.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-10T15:43:59.323Z · LW · GW

Idk what the LW community can do but somehow, to the extent we think liberalism is valuable, the Western democracies need to urgently put a hard stop to Russia and China war (preparation) efforts. I fear that rearmament is a key component of the only viable path at this stage.

I won't argue in details here but link to Noahpinion, who's been quite vocal on those topics. The TLDR is that China and Russia have been scaling their war industry preparation efforts for years, while Western democracies industries keep declining and remain crazily dependent from the Chinese industry. This creates a new global equilibrium where the US is no longer powerful enough to disincentivize all authoritarians regime from grabbing more land etc.

Some readings relevant to that:

I know this is not a core LW theme but to the extent this threat might be existential to liberalism, and to the existence of LW as a website in the first place, I think we should all care. It would also be quite terrible for safety if AGI was developed during a global war, which seems uncomfortably likely (~10% imo).

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-08T21:33:41.398Z · LW · GW

If you wanna reread the debate, you can scroll through this thread (https://x.com/bshlgrs/status/1764701597727416448). 

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-08T18:21:29.546Z · LW · GW

There was a hot debate recently but regardless, the bottom line is just "RSPs should probably be interpreted literally and nothing else. If a literal statement is not strictly there, it should be assumed it's not a commitment."

I've not seen people doing very literal interpretation on those so I just wanted to emphasize that point.

Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-08T12:27:54.773Z · LW · GW

Given the recent argument on whether Anthropic really did commit to not push the frontier or just misled most people into thinking that it was the case, it's relevant to reread the RSPs in hairsplitting mode. I was rereading the RSPs and noticed a few relevant findings:

Disclaimer: this is focused on negative stuff but does not deny the merits of RSPs etc etc.

  1. I couldn't find any sentence committing to not significantly increase extreme risks. OTOH I found statements that if taken literally could imply an implicit acknowledgment of the opposite: "our most significant immediate commitments include a high standard of security for ASL-3 containment, and a commitment not to deploy ASL-3 models until thorough red-teaming finds no risk of catastrophe.". 
    Note that it makes a statement on the risk only bearing on deployment measures and not on security. Given that the lack security is probably the biggest source of risk of ASL-3 systems & the biggest weakness of RSPs, I find it pretty likely that this is not random.
  2. I found a number of commitments that are totally unenforceable in hairsplitting mode. Here are two examples: 
    1. "World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse." 
      1. The use of five underdefined adjectives + "significantly" is a pretty safe barrier against any enforcement.
    2. "When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g. 50% of the time in which catastrophic harm could realistically occur)."
      1. The combination of "or", of the characterization of promptly as "50% of the time, the use of "e.g." and of "realistically" is also a safe barrier against enforceability. 
  3. It's only my subjective judgment here and you don't have to trust it but I also found Core Views on AI Safety to have a number of similar patterns.
Comment by simeon_c (WayZ) on simeon_c's Shortform · 2024-04-04T09:01:49.177Z · LW · GW

There's a number of properties of AI systems that makes it easier to collect information in a safe way about those systems and hence demonstrate their safety: interpretability, formal verifiability, modularity etc. Which adjective wd you use to characterize those properties?

 

I'm thinking of "resilience" because from the perspective of an AI developer it helps a lot understanding the risk profile, but do you have other suggestions? 

Some alternatives: 

  1. auditability properties
  2. legibility properties
Comment by simeon_c (WayZ) on Vote on Anthropic Topics to Discuss · 2024-03-06T23:57:27.176Z · LW · GW

Unsure how much we disagree Zach and Oliver so I'll try to quantify: I would guess that Claude 3 will cut release date of next gen models from OpenAI by a few months at least (I would guess 3 months), which has significant effects on timelines.

Tentatively, I'm thinking that this effect may be surlinear. My model is that each new release increases the speed of development (bc of increased investment in all the value chain including compute + realization from people that it's not like other technologies etc) and so that a few months now causes more than a few months on AGI timelines.

Comment by WayZ on [deleted post] 2024-02-11T17:28:56.631Z

Oh thanks, I hadn't find it, gonna delete!

Comment by simeon_c (WayZ) on Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis · 2024-02-02T06:27:47.746Z · LW · GW

Yeah basically Davidad has not only a safety plan but a governance plan which actively aims at making this shift happen!

Comment by simeon_c (WayZ) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-26T17:46:39.653Z · LW · GW

Thanks for writing that. I've been trying to taboo "goals" because it creates so much confusion, which this post tries to decrease. In line with this post, I think what matters is how difficult a task is to achieve, and what it takes to achieve it in terms of ability to overcome obstacles.

Comment by simeon_c (WayZ) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T13:44:53.371Z · LW · GW
Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-28T13:30:45.554Z · LW · GW

"Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL."

And/or = or, so I just want to flag that the actual commitment here cd be as weak as "we delay the deployment but keep scaling internally". If it's a mistake, you can correct it, but if it's not, it doesn't seem like a robust commitment to pause to me, even assuming that the conditions of pause were well established.

Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-28T12:13:18.480Z · LW · GW

Because it's meaningless to talk about a "compromise" dismissing one entire side of the people who disagree with you (but only one side!).

Like I could say "global compute thresholds is a robustly good compromise with everyone who disagrees with me"

*Footnote: only those who're more pessimistic than me.

Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-27T22:49:03.362Z · LW · GW

That may be right but then the claim is wrong. The true claim would be "RSPs seem like a robustly good compromise with people who are more optimistic than me".

And then the claim becomes not really relevant?

Comment by simeon_c (WayZ) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-27T20:56:39.397Z · LW · GW

Holden, thanks for this public post. 

  1. I would love if you could write something along the lines of what you wrote in "If it were all up to me, the world would pause now - but it isn’t, and I’m more uncertain about whether a “partial pause” is good" at the top of ARC post, which as we discussed and as I wrote in my post would make RSPs more likely to be positive in my opinion by making the policy/voluntary safety commitments distinction clearer.

Regarding 

Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine

2. It seems like it's empirically wrong based on the strong pushback RSPs received so that at least you shouldn't call it "robustly", unless you mean a kind of modified version that would accommodate the most important parts of the pushback. 

3. I feel like overall the way you discuss RSPs here is one of the many instances of people chatting about idealized RSPs that are not specified, and pointed to against disagreement. See below, from my post:

And second, the coexistence of ARC's RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:

  • “That’s the V1. We’ll raise ambition over time”. I’d like to see evidence of that happening over a 5 year timeframe, in any field or industry. I can think of fields, like aviation where it happened over the course of decades, crashes after crashes. But if it’s relying on expectations that there will be large scale accidents, then it should be clear. If it’s relying on the assumption that timelines are long, it should be explicit. 
  • “It’s voluntary, we can’t expect too much and it’s way better than what’s existing”. Sure, but if the level of catastrophic risks is 1% (which several AI risk experts I’ve talked to believe to be the case for ASL-3 systems) and that it gives the impression that risks are covered, then the name “responsible scaling” is heavily misleading policymakers. The adequate name for 1% catastrophic risks would be catastrophic scaling, which is less rosy.

Thanks for the post.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T18:45:16.459Z · LW · GW

Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."

Great question! A few points: 

  1. Yes, many of the things I point are "how to do things well" and I would in fact much prefer something that contains a section "we are striving towards that and our current effort is insufficient" than the current RSP communication which is more "here's how to responsibly scale". 
  2. That said, I think we disagree on the reference class of the effort (you say "a few years"). I think that you could do a very solid MVP of what I suggest with like 5 FTEs over 6 months. 
  3. As I wrote in "How to move forward" (worth skimming to understand what I'd change) I think that RSPs would be incredibly better if they: 
    1. had a different name
    2. said that they are insufficient
    3. linked to a post which says "here's the actual thing which is needed to make us safe". 
  4. Answer to your question: if I were optimizing in the paradigm of voluntary lab commitments as ARC is, yes I would much prefer that. I flagged early though that because labs are definitely not allies on this (because an actual risk assessment is likely to output "stop"), I think the "ask labs kindly" strategy is pretty doomed and I would much prefer a version of ARC trying to acquire bargaining power through a way or another (policy, PR threat etc.) rather than adapting their framework until labs accept to sign them. 

Regarding 

If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".

I don't think it's necessarily right, e.g. "the ISO standard asks the organization to define risk thresholds" could be a very simple task, much simpler than developing a full eval. The tricky thing is just to ensure we comply with such levels (and the inability to do that obviously reveals a lack of safety). 

"ISO proposes a much more comprehensive procedure than RSPs", it's not right either that it would take longer, it's just that there exists risk management tools, that you can run in like a few days, that helps having a very broad coverage of the scenario set.

"imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?" once again you can cover the most obvious things in like a couple pages. Writing "Maybe they would give the weights to their team of hackers, which increases substantially the chances of leak and global cyberoffence increase". And I would be totally fine with half-baked things if they were communicated as such and not as RSPs are.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T18:25:15.764Z · LW · GW

Two questions related to it: 

  1. What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it's extremely hard)?
  2. Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what's your guess of p(catastrophe)?
Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T14:17:35.363Z · LW · GW

Thanks Eli for the comment. 

One reason why I haven't provided much evidence is that I think it's substantially harder to give evidence of a "for all" claim (my side of the claim) than a "there exists" (what I ask Evan). I claim that it doesn't happen that a framework on a niche area evolves so fast without accidents based on what I've seen, even in domains with substantial updates, like aviation and nuclear.

I could potentially see it happening with large accidents, but I personally don't want to bet on that and I would want it to be transparent if that's the assumption. I also don't buy the "small coordinations allow larger coordinations" for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that's sufficiently good-looking to stakeholders to not have substantial incentives to change.

GDPR cookies banner sucks for everyone and haven't been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I'm talking about standards, not regulation), and we'll have to bargain to try to bring it down to reasonable timeframes AI-specific. 

IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we're talking about decades, not 5 years.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T11:59:00.322Z · LW · GW

Thanks for your comment. 
 

I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done.

I strongly disagree with this. In my opinion, a lot of the issue is that RSPs have been thought from first principles without much consideration for everything the risk management field has done, and hence doing wrong stuff without noticing. 

It's not a matter of how detailed they are; they get the broad principles wrong. As I argued (the entire table is about this) I think that the existing principles of other existing standards are just way better and so no, it's not a matter of details. 

As I said, the details & evals of RSPs is actually the one thing that I'd keep and include in a risk management framework. 

Honestly I can't think of anything much better that could have been reasonably done given the limited time and resources we all have

Well, I recommend looking at Section 3 and the source links. Starting from those frameworks and including evals into it would be a Pareto improvement.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T11:47:21.420Z · LW · GW

Thanks for your comment. 

One issue is that everyone disagrees.

That's right and that's a consequence of uncertainty, which prevents us from bounding risks. Decreasing uncertainty (e.g. through modelling or through the ability to set bounds) is the objective of risk management.

Doses of radiation are quite predictable

I think it's mostly in hindsight. When you read stuff about nuclear safety in the 1970s, it's really not how it was looking.

See Section 2

the arc of new technology is not [predictable]

I think that this sets a "technology is magic" vibe which is only valid for scaling neural nets (and probably only because we haven't invested that much into understanding scaling laws etc.), and not for most other technologies. We can actually develop technology where we know what it's doing before building it and that's what we should aim for given what's at stakes here.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T11:36:37.230Z · LW · GW

Thanks a lot for this constructive answer, I appreciate the engagement.

I'll agree that it would be nice if we knew how to do this, but we do not.
With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).

Three points on that: 

  1. I agree that we're pretty bad at measuring risks. But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all. 
  2. I think that we should do our best and measure conservatively, and that to the extent we're uncertainty, it should be reflected in calibrated risk estimates. 
  3. I do expect the first few shots of risk estimate to be overconfident, especially to the extent they include ML researchers' estimates. My sense from nuclear is that it's what happened there and that failures after failures, the field got red pilled. You can read more on this here (https://en.wikipedia.org/wiki/WASH-1400). 
    1. Related to that, I think that it's key to provide as many risk estimate feedback loops as possible by forecasting incidents in order to red-pill the field faster on the fact that they're overconfident by default on risk levels. 

This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don't understand. We cannot demonstrate risks are below acceptable levels.

That's more complicated than that to the extent you could probably train code generation systems or other systems with narrowed down domain of operations, but I indeed think that on LLMs, risk levels would be too high to keep scaling >4 OOMs on fully general LLMs that can be plugged to tools etc. 

I think that it would massively benefit to systems we understand and have could plausibly reach significant levels of capabilities at some point in the future (https://arxiv.org/abs/2006.08381). It would probably lead labs to massively invest into that. 

Given our current levels of understanding, all a team of "experts" could do would be to figure out a lower bound on risk. I.e. "here are all the ways we understand that the system could go wrong, making the risk at least ...".

I agree by default we're unable to upper bound risks and I think that's it's one additional failure of RSP to make as if we were able to do so. The role of calibrated forecasters in the process is to ensure that they help keeping in mind the uncertainty arising from this.

 

 

Why is pushing for risk quantification in policy a bad idea?

[...]

However, since "We should stop immediately because we don't understand" can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.

Quantified risk estimates that are wrong are much worse than underdefined statements.

  1. I think it's a good point and that there should be explicit caveat to limit that but that they won't be enough.
  2. I think it's a fair concern for quantified risk assessment and I expect it to be fairly likely that we fail in certain ways if we do only quantified risk assessment over the next few years. Thats why I do think we should not only do that but also deterministic safety analysis and scenario based risk analysis, which you could think of as sanity checks to ensure you're not completely wrong in your quantified risk assessment.
  3. Reading your points, I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it. Hence, I expect quantified risk assessment to reveal our lack of understanding rather than suffer from it by default. I still think that your point will partially hold but much less than in the world where Anthropic dismisses accidental risks as speculative and say they're "unlikely" (which as I say cd mean 1/1000, 1/100 or 1/10 but the lack of explicitation makes the statement reasonable sounding) without saying "oh btw we really don't understand our systems". 

Once again, thanks a lot for your comment!

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T09:39:03.654Z · LW · GW

A more in-depth answer: 

  1. The permanent motte and bailey that RSPs allow  (easily defensible: a framework that seems arbitrarily extensible combined with the belief that you can always change stuff in policy, even over few-years timeframe ; hardly defensible: the actual implementations & the communication around RSPs) is one of the concerns I raise explicitly, and what this comment is doing. Here, while I'm talking in large parts about the ARC RSP principles, you say that I'm talking about "current RSPs". If you mean that we can change even the RSP principles (and not only their applicatino) to anyone who criticizes the principles of RSPs, then it's a pretty effective way to make something literally impossible to criticize. We could have taken an arbitrary framework, push for it and say "we'll do better soon, we need wins". Claiming that we'll change the framework (not only the application) in 5 years is a very extraordinary claim and does not seem a good reason to start pushing for a bad framework in the first place. 
  1. That it's not true. The "Safe Zone" in ARC graph clearly suggests that ASL-3 are sufficient. The announce of Anthropic says "require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk". It implies that ASL-3 measures are sufficient, without actually quantifying the risk (one of the core points of my post), even qualitatively.
  2. At a meta level, I find frustrating that the most upvoted comment, your comment, be a comment that hasn't seriously read the post, still makes a claim about the entire post, and doesn't address my request for evidence about the core crux (evidence of major framework changes within 5 years). If "extremely short timelines" means 5y, it seems like many have "extremely short timelines". 
Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T01:32:25.826Z · LW · GW

You can see this section which talks about the points you raise.

Comment by simeon_c (WayZ) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T01:17:56.113Z · LW · GW

Thanks for your comment. 

  1. I'm sorry for that magnitude of misunderstanding, and will try to clarify it upfront in the post, but a large part of my argument is about why the principles of RSPs are not good enough, rather than the specific implementation (which is also not sufficient though, and which I argue in "Overselling, underdelivering" is one of the flaws of the framework and not just a problem that will pass). 
    1. You can check Section 3 for why I think that the principles are flawed, and Section 1 and 2 to get a better sense of what better principles look like.
  2. Regarding the timeline, I think that it's unreasonable to expect major framework changes over less than 5 years. And as I wrote, if you think otherwise, I'd love to hear any example of that happening in the past and the conditions under which it happened. 
    1. I do think that within the RSP framework, you can maybe get better but as I argue in Section 3, I think the framework is fundamentally flawed and should be replaced by a standard risk management framework, in which we include evals.
Comment by simeon_c (WayZ) on Lying is Cowardice, not Strategy · 2023-10-25T00:07:54.181Z · LW · GW

A few other examples off the top of my head:

  • ARC graph on RSPs with the "safe zone" part
  • Anthropic calling ASL-4 accidental risks "speculative"
  • the recent TIME article saying there's no trade off between progress and safety

More generally, for having talked to many AI policy/safety members, I can say it's a very common pattern. At the eve of the FLI open letter, one of the most senior persons in the AI governance & policy X risk community was explaining that it was stupid to write this letter and that it would make future policy efforts much more difficult etc.

Comment by simeon_c (WayZ) on Lying is Cowardice, not Strategy · 2023-10-24T14:51:05.786Z · LW · GW

I think it still makes sense to have a heuristic of the form "I should have a particularly high bar of confidence If I do something deontologically bad that happens to be good for me personally"

Comment by simeon_c (WayZ) on Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) · 2023-10-15T19:09:44.618Z · LW · GW

Thanks a lot for writing that post.

One question I have regarding fast takeoff is: don't you expect learning algorithms much more efficient than SGD to show up and accelerate a lot the rate of development of capabilities?

One "overhang' I can see it the fact that humans have written a lot of what they know how to do all kinds of task on the internet and so a pretty data efficient algo could just leverage this and fairly suddenly learn a ton of tasks quite rapidly. For instance, in context learning is way more data efficient than SGD in pre-training. Right now it doesn't seem like in context learning is exploited nearly as much as it could be. If we manage to turn ~any SGD learning problem into an in-context learning problem, which IMO could happen with an efficient long term memory and a better long context length, things could accelerate pretty wildly. Do you think that even things like that (i.e. we unlock a more data efficient Algo which allows much faster capabilities development) will necessarily be smoothed?

Comment by simeon_c (WayZ) on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-20T11:14:20.164Z · LW · GW

Cool thanks. 
I've seen that you've edited your post. If you look at ASL-3 Containment Measures, I'd recommend considering editing away the "Yay" aswell. 
This post is a pretty significant goalpost moving. 

While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor. 

So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, and plausibly multiple states, will have access to those things. 

It also means that the chances of leaking a system which is irreversibly catastrophic are probably not below 0.1%, maybe not even below 1%. 


My interpretation of the excitement around the proposal is a feeling that "yay, it's better than where we were before". 
But I think it neglects heavily a few things. 
1. It's way worse than risk management 101, which is easy to push for.
2. the US population is pro-slowdown (so you can basically be way more ambitious than "responsibly scaling")
3. an increasing share of policymakers are worried
4. self-regulation has a track record of heavily affecting hard law (either by preventing it, or by creating a template that the state can enforce. That's the ToC that I understood from people excited by self-regulation). For instance I expect this proposal to actively harm the efforts to push for ambitious slowdowns that would let us put the probability of doom below two-digit numbers. 


For those reasons, I wish this doc didn't exist. 

Comment by simeon_c (WayZ) on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-19T22:15:11.625Z · LW · GW

Can you quote the parts you're referring to?

Comment by simeon_c (WayZ) on In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence? · 2023-09-16T11:35:04.450Z · LW · GW

I agree with this general intuition, thanks for sharing. 

 

I'd value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against "bad instrumental convergence" but where we fail/ or a better sense of how you'd guess it would look like on an LLM agent or a scaled GPT. 

Comment by simeon_c (WayZ) on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-08-02T22:19:17.231Z · LW · GW

I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.

I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell. 

  • Labs have been pushing for the rule that we should wait for evals to say "it's dangerous" before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe. 
  • Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines.

Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier. 

will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise."

I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.

Comment by simeon_c (WayZ) on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-07-15T16:15:23.471Z · LW · GW

Thanks for the clarifications. 

But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?

1. I think we agree on the fact that "unless it's provably safe" is the best version of trying to get a policy slowdown. 
2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you're say too loudly that we should slow down as long as our thing is not provably safe.

So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor. 

Other interventions for slowdown are mostly in the realm of public advocacy. 

Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns. 


I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of "heat" to be felt regardless of any one player's actions). And the potential benefits seem big. My rough impression is that you're confident the costs outweigh the benefits for nearly any imaginable version of this; if that's right, can you give some quantitative or other sense of how you get there?

I guess, heuristically, I tend to take arguments of the form "but others would have done this bad thing anyway" with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post).

On this specific case I think it's not right that there are "lots of players" close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven't been serious players for the past 7 months. So if Anthropic hadn't been around, OA could have chilled longer at ChatGPT level, and then at GPT-4 without plugins + code interpreter & without suffering from any threat. And now they'll need to do something very impressive against the 100k context etc. 

The compound effects of this are pretty substantial because for each new differentiation, it accelerates the whole field and pressures teams to find something new, causing a significantly more powerful race to the bottom. 

If I had to be quantitative (vaguely) for the past 9 months, I'd guess that the existence of Anthropic has caused (/will cause, if we count the 100k thing) 2 significant counterfactual features and 3-5 months of timelines (which will probably compound into more due to self-improvement effects). I'd guess there are other effects (e.g. pressure on compute, scaling for driving costs down etc.) I'm not able to give vague estimates for. 

My guess for the 3-5 months is mostly driven by the release of ChatGPT & GPT-4 which have both likely been released earlier than without Anthropic.

Comment by simeon_c (WayZ) on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T00:30:43.174Z · LW · GW

So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind? 

Comment by simeon_c (WayZ) on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-09T02:22:28.296Z · LW · GW

Thanks for writing that up. 

I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". And this core consideration is also why I don't think that the "Successful, careful AI lab" is right. 

Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

Comment by simeon_c (WayZ) on Launching Lightspeed Grants (Apply by July 6th) · 2023-06-07T03:18:09.706Z · LW · GW

Extremely excited to see this new funder. 
I'm pretty confident that we can indeed find a significant number of new donors for AI safety since the recent Overton window shift. 

Chatting with people with substantial networks, it seemed to me like a centralized non-profit fundraising effort could probably raise at least $10M. Happy to intro you to those people if relevant @habryka

And reducing the processing time is also very exciting. 

So thanks for launching this.

Comment by simeon_c (WayZ) on My Assessment of the Chinese AI Safety Community · 2023-04-26T13:33:38.221Z · LW · GW

Thanks for writing this.

Overall, I don't like the post much under it's current form. There's ~0 evidence (e.g. from Chinese newspapers) and there is very little actual argumentation. I like that you give us a local view but putting a few links to back your claims would be very very appreciated. Right now it's hard to update on your post given that the claims are very empirical and without any external sources.

More minorly: "A domestic regulation framework for nuclear power is not a strong signal for a willingness to engage in nuclear arms reduction" I also disagree with this statement. I think it's definitely a signal.

Comment by simeon_c (WayZ) on Deep learning models might be secretly (almost) linear · 2023-04-26T08:23:24.497Z · LW · GW

@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis. 

In the same post though, we suggest that it's unclear how much logit lens "works", to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5.

But what seems plausible to me is that basically almost ~all of the information relevant to a feature are encoded in a very small amounts of directions, which are slightly different for each layer. 

Comment by simeon_c (WayZ) on The Agency Overhang · 2023-04-21T08:22:55.615Z · LW · GW

I'd add that it's not an argument to make models agentic in the wild. It's just an argument to be already worried.

Comment by simeon_c (WayZ) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-04-19T16:33:28.140Z · LW · GW

Thanks for writing that up Charbel & Gabin. Below are some elements I want to add.

In the last 2 months, I spent more than 20h with David talking and interacting with his ideas and plans, especially in technical contexts. 
As I spent more time with David, I got extremely impressed by the breadth and the depth of his knowledge. David has cached answers to a surprisingly high number of technically detailed questions on his agenda, which suggests that he has pre-computed a lot of things regarding his agenda (even though it sometimes look very weird on first sight). I noticed that I never met anyone as smart as him. 

Regarding his ability to devise a high level plan that works in practice, David has built a technically impressive crypto (today ranked 22nd) following a similar methodology, i.e. devising the plan from first principles. 

Finally, I'm excited by the fact that David seems to have a good ability to build ambitious coalitions with researchers, which is a great upside for governance and for such an ambitious proposal. Indeed, he has a strong track record of convincing researchers to work on his stuff after talking for a couple hours, because he often has very good ideas on their field.

These elements, combined with my increasing worry that scaling LLMs at breakneck speed is not far from certain to kill us, make me want to back heavily this proposal and pour a lot of resources into it. 

I'll thus personally dedicate in my own capacity an amount of time and resources to try to speed that up, in the hope (10-20%) that in a couple of years it could become a credible proposal as an alternative to scaled LLMs. 

Comment by simeon_c (WayZ) on AI Takeover Scenario with Scaled LLMs · 2023-04-17T07:43:51.410Z · LW · GW

I'll focus on 2 first given that it's the most important. 2. I would expect sim2real to not be too hard for foundations models because they're trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn't be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I'm not certain but I feel like robotics is more sensitive to details than plans (which is why I'm mentioning a simulation here). Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.

  1. I agree that it's not something which is very likely. But I disagree that "nobody would do that". People would do that if it were useful.

  2. I've asked some ML engineers and it happens that you don't look at it for a day. I don't think that deploying it in the real world changes much. Once again you're also assuming a pretty advanced formb of security mindset.

Comment by simeon_c (WayZ) on Navigating AI Risks (NAIR) #1: Slowing Down AI · 2023-04-16T20:52:40.096Z · LW · GW

Yes, I definitely think that countries with strong deontologies will try to solve some narrow versions of alignment harder than those that tolerate failures. 

I think it's quite reassuring and means that it's quite reasonable to focus on the US quite a lot in our governance approaches.

Comment by simeon_c (WayZ) on Campaign for AI Safety: Please join me · 2023-04-15T13:15:34.515Z · LW · GW

I think that this is misleading to state it that way. There were definitely dinners and discussions with people around the creation of OpenAI. 
https://timelines.issarice.com/wiki/Timeline_of_OpenAI 
Months before the creation of OpenAI, there was a discussion including Chris Olah, Paul Christiano, and Dario Amodei on the starting of OpenAI: "Sam Altman sets up a dinner in Menlo Park, California to talk about starting an organization to do AI research. Attendees include Greg Brockman, Dario Amodei, Chris Olah, Paul Christiano, Ilya Sutskever, and Elon Musk."

Comment by simeon_c (WayZ) on You are probably not a good alignment researcher, and other blatant lies · 2023-02-09T07:03:48.030Z · LW · GW

Also, I think that it's fine to have less chances of being an excellent alignment research for that reason. What matters is having impact, not being an excellent alignment researcher. E.g. I don't go full-in a technical career myself essentially for that reason, combined with the fact that I have other features that might allow me to go further in the impact tail in other subareas that are relevant. 

Comment by simeon_c (WayZ) on You are probably not a good alignment researcher, and other blatant lies · 2023-02-09T06:59:44.539Z · LW · GW

If I try to think about someone's IQ (which I don't normally do, except for the sake of this message above where I tried to think about a specific number to make my claim precise) I feel like I can have an ordering where I'm not too uncertain on a scale that includes me, some common reference classes (e.g. the median student of school X has IQ Y), and a few people who did IQ tests around me. I'd by the way be happy to bet on anyone if someone accepted to reveal their IQ (e.g. from the list of SERI MATS's mentors) if you think my claim is wrong. 

Comment by simeon_c (WayZ) on You are probably not a good alignment researcher, and other blatant lies · 2023-02-03T05:32:47.528Z · LW · GW

Thanks for writing that. 

Three thoughts that come to mind: 

  • I feel like a more right claim is something like "beyond a certain IQ, we don't know what makes a good alignment researcher". Which I think is a substantially weaker claim than the one which is underlying your post. I also think that the fact that the probability of being a good alignment researcher increases with IQ is relevant if true (and I think it's very likely to be true, as for most sciences where Nobels are usually outliers along that axis). 
  • I also feel like I would expect predictors from other research fields to roughly apply (e.g. conscientiousness). 
  • In this post you don't cover what seems to be the most important part of why sometimes advice that are of the form "It seems like given features X and Y you're more likely to be able to fruitfully contribute to Z" (which seems to be adjacent to the claims you're criticizing), i.e. the opportunity cost of someone.