Posts

Scenario Forecasting Workshop: Materials and Learnings 2024-03-08T02:30:46.517Z
Forecasting future gains due to post-training enhancements 2024-03-08T02:11:57.228Z
Discussing how to align Transformative AI if it’s developed very soon 2022-11-28T16:17:54.390Z
Eli's review of "Is power-seeking AI an existential risk?" 2022-09-30T12:21:19.465Z
Quantified Intuitions: An epistemics training website including a new EA-themed calibration app 2022-09-20T22:25:44.973Z
Prize idea: Transmit MIRI and Eliezer's worldviews 2022-09-19T21:21:13.156Z
Forecasting thread: How does AI risk level vary based on timelines? 2022-09-14T23:56:30.341Z
Samotsvety's AI risk forecasts 2022-09-09T04:01:18.958Z
(My understanding of) What Everyone in Technical Alignment is Doing and Why 2022-08-29T01:23:58.073Z
Discussion on utilizing AI for alignment 2022-08-23T02:36:25.207Z
Personal forecasting retrospective: 2020-2022 2022-07-21T00:07:03.511Z
Impactful Forecasting Prize for forecast writeups on curated Metaculus questions 2022-02-04T20:06:16.408Z
elifland's Shortform 2022-01-11T16:35:04.019Z
My Hypermind Arising Intelligence Forecasts and Reflections 2021-09-26T20:47:28.267Z

Comments

Comment by elifland on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T22:26:28.568Z · LW · GW

Thanks. I edited again to be more precise. Maybe I'm closer to the median than I thought.

(edit: unimportant clarification. I just realized "you all" may have made it sound like I thought every single person on the Lightcone team was higher than my p(doom). I meant it to be more like a generic y'all to represent the group, not a claim about the minimum p(doom) of the team)

Comment by elifland on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T22:21:42.720Z · LW · GW

Yeah I meant more on p(doom)/alignment difficulty than timelines, I'm not sure what your guys' timelines are. I'm roughly in the 35-55% ballpark for a misaligned takeover, and my impression is that you all are closer to but not necessarily all the way at the >90% Eliezer view. If that's also wrong I'll edit to correct.

edit: oh maybe my wording of "farther" in the original comment was specifically confusing and made it sound like I was talking about timelines. I will edit to clarify.

Comment by elifland on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T21:43:35.903Z · LW · GW

Appreciate the post. I've previously donated $600 through the EA Manifund thing and will consider donating again late this year / early next year when thinking through donations more broadly.

I've derived lots of value with regards to thinking through AI futures from LW/AIAF content (some non-exhaustive standouts: 2021 MIRI conversations, List of Lethalities and Paul response, t-AGI framework, Without specific countermeasures..., Hero Licensing). It's unclear to me how much of the value would have been retained if LW didn't exist, but plausibly LW is responsible for a large fraction.

In a few ways I feel not fully/spiritually aligned with the LW team and the rationalist community: my alignment difficulty/p(doom()[1] is farther from Eliezer's[2] than my perception of the median of the LW team[3] (though closer to Eliezer than most EAs), I haven't felt sucked in by most of Eliezer's writing, and I feel gut level cynical about people's ability to deliberatively improve their rationality (edit: with large effect size) (I haven't spent a long time examining evidence to decide whether I really believe this).

But still LW has probably made a large positive difference in my life, and I'm very thankful. I've also enjoyed Lighthaven, but I have to admit I'm not very observant and opinionated on conference venues (or web design, which is why I focused on LW's content).

  1. ^

    Previously just said "AI forecasts", edited to make more specific the view that I'm talking about.

  2. ^

    Previously said MIRI. edited MIRI -> Eliezer since MIRI has somewhat heterogenous views

  3. ^

    Previously just said "LW team", added "the median of" to better represent heterogeneity

Comment by elifland on things that confuse me about the current AI market. · 2024-08-29T16:19:56.007Z · LW · GW

Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources

Both of these seem false.

Re: talent, see from their website:

They don't list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.

Re: resources, according to Elon's early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, according to SemiAnalysis). And xAI was working on a 100k H100 cluster that was on track to be finished in July. Also they raised $6B in May.

Comment by elifland on Habryka's Shortform Feed · 2024-07-04T19:33:25.368Z · LW · GW

And internally, we have an anonymous RSP non-compliance reporting line so that any employee can raise concerns about issues like this without any fear of retaliation.
 

Are you able to elaborate on how this works? Are there any other details about this publicly, couldn't find more detail via a quick search.

Some specific qs I'm curious about: (a) who handles the anonymous complaints, (b) what is the scope of behavior explicitly (and implicitly re: cultural norms) covered here, (c) handling situations where a report would deanonymize the reporter (or limit them to a small number of people)?

Comment by elifland on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T18:03:07.918Z · LW · GW

Thanks for the response!

I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.

[...]

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round

Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.

Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals

I'm having trouble parsing this sentence. What you mean by "doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals"? Doesn't pausing include focusing on mitigations and evals?

Comment by elifland on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T05:26:03.364Z · LW · GW

From the RSP Evals report:

As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Scaling Policy.

What's the minimum X% that could replace 30% and would be treated the same as passing the yellow line immediately, if any? If you think that there's an X% chance that with 3 more months of elicitation, a yellow line will be crossed, what's the decision-making process for determining whether you should treat it as already being crossed?

In the RSP it says "It is important that we are evaluating models with close to our best capabilities elicitation techniques, to avoid underestimating the capabilities it would be possible for a malicious actor to elicit if the model were stolen" so it seems like folding in some forecasted elicited capabilities into the current evaluation would be reasonable (though they should definitely be discounted the further out they are).

(I'm not particularly concerned about catastrophic risk from the Claude 3 model family, but I am interested in the general policy here and the reasoning behind it)

Comment by elifland on elifland's Shortform · 2024-05-14T04:40:58.511Z · LW · GW

The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:

  1. They gave a binary probability that is too far from 50% (I believe this is the original one)
  2. They overestimated a binary probability (e.g. they said 20% when it should be 1%)
  3. Their estimate is arrogant (e.g. they say there's a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
  4. They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
  5. They gave a probability distribution that seems wrong in some way (e.g. "50% AGI by 2030 is so overconfident, I think it should be 10%")
    • This one is pernicious in that any probability distribution gives very low percentages for some range, so being specific here seems important.
  6. Their binary estimate or probability distribution seems too different from some sort of base rate, reference class, or expert(s) that they should defer to.

How much does this overloading matter? I'm not sure, but one worry is that it allows people to score cheap rhetorical points by claiming someone else is overconfident when in practice they might mean something like "your probability distribution is wrong in some way". Beware of accusing someone of overconfidence without being more specific about what you mean.

Comment by elifland on Habryka's Shortform Feed · 2024-05-04T01:02:49.115Z · LW · GW

I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population

Comment by elifland on elifland's Shortform · 2024-03-13T19:55:03.575Z · LW · GW

[cross-posting from blog]

I made a spreadsheet for forecasting the 10th/50th/90th percentile for how you think GPT-4.5 will do on various benchmarks (given 6 months after the release to allow for actually being applied to the benchmark, and post-training enhancements). Copy it here to register your forecasts.

If you’d prefer, you could also use it to predict for GPT-5, or for the state-of-the-art at a certain time e.g. end of 2024 (my predictions would be pretty similar for GPT-4.5, and end of 2024).

You can see my forecasts made with ~2 hours of total effort on Feb 17 in this sheet; I won’t describe them further here in order to avoid anchoring.

There might be a similar tournament on Metaculus soon, but not sure on the timeline for that (and spreadsheet might be lower friction). If someone wants to take the time to make a form for predicting, tracking and resolving the forecasts, be my guest and I’ll link it here.

Comment by elifland on Scenario Forecasting Workshop: Materials and Learnings · 2024-03-09T17:52:25.384Z · LW · GW

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).

 

FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.

Comment by elifland on The World in 2029 · 2024-03-03T18:58:06.081Z · LW · GW

Interesting, thanks for clarifying. It's not clear to me that this is the right primary frame to think about what would happen, as opposed to just thinking first about how big compute bottlenecks are and then adjusting the research pace for that (and then accounting for diminishing returns to more research). 

I think a combination of both perspectives is best, as the argument in your favor for your frame is that there will be some low-hanging fruit from changing your workflow to adapt to the new cognitive labor.

Comment by elifland on The World in 2029 · 2024-03-03T18:46:27.689Z · LW · GW

Physical bottlenecks still exist, but is it really that implausible that the capabilities workforce would stumble upon huge algorithmic efficiency improvements? Recall that current algorithms are much less efficient than the human brain. There's lots of room to go.

I don't understand the reasoning here. It seems like you're saying "Well, there might be compute bottlenecks, but we have so much room left to go in algorithmic improvements!" But the room to improve point is already the case right now, and seems orthogonal to the compute bottlenecks point.

E.g. if compute bottlenecks are theoretically enough to turn the 5x cognitive labor into only 1.1x overall research productivity, it will still be the case that there is lots of room for improvement but the point doesn't really matter as research productivity hasn't sped up much. So to argue that the situation has changed dramatically you need to argue something about how big of a deal the compute bottlenecks will in fact be.

Comment by elifland on The World in 2029 · 2024-03-03T18:37:21.584Z · LW · GW

Imagine the current AGI capabilities employee's typical work day. Now imagine they had an army of AI assisstants that can very quickly do 10 hours worth of their own labor. How much more productive is that employee compared to their current state? I'd guess at least 5x. See section 6 of Tom Davidson's takeoff speeds framework for a model.

Can you elaborate how you're translating 10-hour AI assistants into a 5x speedup using Tom's CES model?

Comment by elifland on The World in 2029 · 2024-03-03T18:35:42.766Z · LW · GW

I agree that <15% seems too low for most reasonable definitions of 1-10 hours and the singularity. But I'd guess I'm more sympathetic than you, depending on the definitions Nathan had in mind.

I think both of the phrases "AI capable doing tasks that took 1-10 hours" and "hit the singularity" are underdefined and making them more clear could lead to significantly different probabilities here.

  1. For "capable of doing tasks that took 1-10 hours in 2024":
    1. If we're saying that "AI can do every cognitive task that takes a human 1-10 hours in 2024 as well as (edit: the best)a human expert", I agree it's pretty clear we're getting extremely fast progress at that point not least because AI will be able to do the vast majority of tasks that take much longer than that by the time it can do all of 1-10 hour tasks. 
    2. However, if we're using a weaker definition like the one Richard used on most cognitive tasks, it beats most human experts who are given 1-10 hours to perform the task, I think it's much less clear due to human interaction bottlenecks.
    3. Also, it seems like the distribution of relevant cognitive tasks that you care about changes a lot on different time horizons, which further complicates things.
  2. Re: "hit the singularity", I think in general there's little agreement on a good definition here e.g. the definition in Tom's report is based on doubling time of "effective compute in 2022-FLOP" shortening after "full automation", which I think is unclear what it corresponds to in terms of real-world impact as I think both of these terms are also underdefined/hard to translate into actual capability and impact metrics. 

I would be curious to hear the definitions you and Nathan had in mind regarding these terms.

Comment by elifland on Retirement Accounts and Short Timelines · 2024-02-20T19:08:52.419Z · LW · GW
Comment by elifland on 2023 in AI predictions · 2024-01-02T14:41:19.072Z · LW · GW

In his AI Insight Forum statement,  Andrew Ng puts 1% on "This rogue AI system gains the ability (perhaps access to nuclear weapons, or skill at manipulating people into using such weapons) to wipe out humanity" in the next 100 years (conditional on a rogue AI system that doesn't go unchecked by other AI systems existing).  And overall 1 in 10 million of AI causing extinction in the next 100 years.

Comment by elifland on Vote on Interesting Disagreements · 2023-11-08T04:38:39.617Z · LW · GW

Among existing alignment research agendas/projects, Superalignment has the highest expected value

Comment by elifland on My thoughts on the social response to AI risk · 2023-11-02T22:18:08.529Z · LW · GW

I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.

I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.

I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope

If you have any you think faithfully represent a possible disagreement between us go ahead. I personally feel it will be very hard to operationalize objective stuff about policies in a satisfying way. For example, a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people. Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035 conditional on no superintelligence, especially if there were no intervention from x-risk people).

Comment by elifland on My thoughts on the social response to AI risk · 2023-11-02T21:46:58.519Z · LW · GW

I have three things to say here:

Thanks for clarifying.

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.

Don't have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now.

I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.

Maybe. In general, I'm excited about people who have the talent for it to think about previously neglected angles.

The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.

I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one. This is the phrase that I was hoping would get made more concrete.

I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time.

I understand this (sorry if wasn't clear), but I think it's less obvious than you do that this trend will continue without intervention from AI x-risk people. I agree with other commenters that AI x-risk people should get a lot of the credit for the recent push. I also provided example reasons that the trend might not continue smoothly or even reverse in my point (3).

There might also be disagreements around:

  1. Not sharing your high confidence in slow, continuous takeoff.
  2. The strictness of regulation needed to make a dent in AI risk, e.g. if substantial international coordination is required it seems optimistic to me to assume that the trajectory will by default lead to this.
  3. The value in things getting done faster than they would have done otherwise, even if they would have been done either way. This indirectly provides more time to iterate and get to better, more nuanced policy.

I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.

Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:

  1. 75%: In Jan 2028, less than 10% of Americans will consider AI the most important problem.
  2. 60%: In Jan 2030, Evan Hubinger will believe that if x-risk-motivated people had not worked on deceptive alignment at all, risk from deceptive alignment would be at least 50% higher, compared to a baseline of no work at all (i.e. if risk is 5% and it would be 9% with no work from anyone, it needs to have been >7% if no work from x-risk people had been done to resolve yes).
  3. 35%: In Jan 2028, conditional on a Republican President being elected in 2024, regulations on AI in the US will be generally less stringent than they were when the previous president left office. Edit: Crossed out because not operationalized well, more want to get at the vibe of how strict the President and legislature are being on AI, and since my understanding is a lot of the stuff from the EO might not come into actual force for a while.
Comment by elifland on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:04:15.917Z · LW · GW

Thus, due to no one's intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research.

The recommendation that current open-source models should be banned is not present in the policy paper, being discussed, AFAICT. The paper's recommendations are pictured below:

Edited to add: there is a specific footnote that says "Note that we do not claim that existing models are already too risky. We also do not make any predictions about how risky the next generation of models will be. Our claim is that developers need to assess the risks and be willing to not open-source a model if the risks outweigh the benefits" on page 31

Comment by elifland on My thoughts on the social response to AI risk · 2023-11-01T23:56:35.559Z · LW · GW

I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.

Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.

  1. Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
    1. You argue that we perhaps shouldn't invest as much in preventing deceptive alignment because "regulators will likely adapt, adjusting policy as the difficulty of the problem becomes clearer"
    2. If we are assuming that regulators will adapt and adjust regarding deception, can you provide examples of interventions that policymakers will not be able to solve themselves and why they will be less likely to notice and deal with them than deception?
    3. You say "we should question how plausible it is that society will fail to adequately address such an integral part of the problem". What things aren't integral parts of the problem but that should be worked on?
      1. I feel we would need much better evidence of things being handled competently to invest significantly less into integral parts of the problem.
  2. You say: 'Of course, it may still be true that AI deception is an extremely hard problem that reliably resists almost all attempted solutions in any “normal” regulatory regime, even as concrete evidence continues to accumulate about its difficulty—although I consider that claim unproven, to say the least'
    1. If we expect some problems in AI risk to be solved by default mostly by people outside the community, it feels to me like one takeaway would be that we should shift resources to portions of the problem that we expect to be the hardest
    2. To me, intuitively, deceptive alignment might be one of the hardest parts of the problem as we scale to very superhuman systems, even if we condition on having time to build model organisms of misalignment and experiment with them for a few years. So I feel confused about why you claim a high level of difficulty is "unproven" as a dismissal; of course it's unproven but you would need to argue that in worlds where the AI risk problem is fairly hard, there's not much of a chance of it being very hard.
    3. As someone who is relatively optimistic about concrete evidence of deceptive alignment increasing substantially before a potential takeover, I think I still put significantly lower probability on it than you do due to the possibility of fairly fast takeoff.
  3. I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). I'm not an expert on what's going on here but I imagine any of the following happening (non-exhaustive list) that make the current path to potentially sensible regulation in the US and internationally harder:
    1. The EO doesn't lead to as many resources dedicated to AI-x-risk-reducing things as we might hope. I haven't read it myself, just the fact sheet and Zvi's summary but Zvi says "If you were hoping for or worried about potential direct or more substantive action, then the opposite applies – there is very little here in the way of concrete action, only the foundation for potential future action."
    2. A Republican President comes in power in the US and reverses a lot of the effects in the EO
    3. Rishi Sunak gets voted out in the UK (my sense is that this is likely) and the new Prime Minister is much less gung-ho on AI risk
  4. I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.
    1. It seems likely that much stronger regulations will be important, e.g. the model reporting threshold in the EO was set relatively high and many in the AI risk community have voiced support for an international pause if it were politically feasible, which the EO is far from.
    2. The public still doesn't consider AI risk to be very important. <1% of the American public considers it the most important problem to deal with. So to the extent that raising that number was good before, it still seems pretty good now, even if slightly worse.
Comment by elifland on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-30T22:13:41.169Z · LW · GW

fOh, I'm certainly not claiming that no-one should attempt to make the estimates.

Ah my bad if I lost the thread there

I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).

Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though

More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]

Seems true in an ideal world but in practice I'd imagine it's much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there's lots of disagreement even within the current alignment field and I don't expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.

To be clear I think if we could push a button for an international pause now it would be great, and I think it's good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.

(of course there's a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it's a question at the margin, etc.)

The other portions of your comment I think I've already given my thoughts on previously, but overall I'd say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I'm not sure it's particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it's perceived via pilots and go from there.

Comment by elifland on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-30T18:34:55.633Z · LW · GW

GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths

I feel like .1% vs. 5% might matter a lot, particularly if we don't have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.

my worry isn't that it's not persuasive that time. It's that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we'll be screwed.

This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.

Comment by elifland on Thoughts on responsible scaling policies and regulation · 2023-10-28T03:43:55.813Z · LW · GW

Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).

I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).

Comment by elifland on Thoughts on responsible scaling policies and regulation · 2023-10-27T21:32:42.229Z · LW · GW

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much

This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.

[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]

Quick thoughts on the less cruxy stuff:

You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.

(based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium

This might coincidentally be close to the 95th percentile I had in mind.

So at that point you obviously aren't talking about 100% of countries voluntarily joining

Fair, I think I was wrong on that point. (I still think it's likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I'm open to changing mind)

I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good")

Sorry if I wasn't clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.

Comment by elifland on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-27T17:15:26.455Z · LW · GW

Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it's worth I'm not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.

The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.

I think quantifying "rough and unprincipled" estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that's not what I expect to happen if the team is competent.

The first makes this kind of approach unnecessary - better to get the cautious people make the case that we have no solid basis to make these assessments that isn't a wild guess.

I'd guess something like "expert team estimates a 1% chance of OpenAI's next model causing over 100 million deaths, causing 1 million deaths in expectation" might hit harder to policymakers than "experts say we have no idea whether OpenAI's models will cause a catastrophe".  The former seems to sound more clearly an alarm than the latter.  This is definitely not my area of expertise though.

Comment by elifland on Thoughts on responsible scaling policies and regulation · 2023-10-27T17:02:35.360Z · LW · GW

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

Appreciate this clarification.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

(but conditioning on a very good implementation)

I'm still confused about the definition of "very good RSPs" and "very good implementation" here. If the evals/mitigations are defined and implemented in some theoretically perfect way by all developers of course it will lead to drastically reduced risk, but "very good" has a lot of ambiguity. I was taking it to mean something like "~95th percentile of the range of RSPs we could realistically hope to achieve before doom", but you may have meant something different. It's still very hard for me to see how under the definition I've laid out we could get to a 10x reduction. Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.

I agree directionally with all of the claims you are making, but (a) I'd guess I have much less confidence than you that even applying very large amounts of effort / accumulated knowledge we will be able to reliably classify a  system as safe or not (especially once it is getting close to and above human-level) and (b) even if we could after several years do this reliably, if you have to do a many-year pause there are various other sources of risk like countries refusing to join / pulling out of the pause and risks from open-source models including continued improvements via fine-tuning/scaffolding/etc.

I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible

Yeah normal-ish was a bad way to put it. I'm skeptical that 3 marginal OOMs is significantly more than ~5% probability to tip the scales but this is just intuition (if anyone knows of projects on the distribution of alignment difficulty, would be curious). I agree that automating alignment is important and that's where a lot of my hope comes from.

[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]

[edited to remove something that was clarified in another comment]

Comment by elifland on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T13:13:20.444Z · LW · GW

I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?

I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.

Comment by elifland on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T12:51:13.484Z · LW · GW

As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted.

I appreciate the evidence you've provided on this, and in particular I think it's more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).

That being said, I don't yet find the evidence you've provided particularly compelling. I believe you are referring mainly to this section of your posts:

In the theory of political capital, it is a fairly well-established fact that “Everybody Loves a Winner.” That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.

I don't understand how the links in this section show that "Everybody Loves a Winner" is a fairly well-established fact that translates to the situation of RSPs. The first link is an op-ed that is paywalled. The second link is a 2013 paper with 7 citations. From the abstract it appears to show that US presidents get higher approval ratings when they succeed in passing legislation, and vice versa. The third link is a 2011 paper with 62 citations (which seems higher, not sure how high this is for its field). From the abstract it appears to show that Presidents which pass agendas in Congress help their party win more Congressional seats. These interpretations don't seem too different from the way you summarized it.

Assuming that this version of "Everybody Loves a Winner" is in fact a well-established fact in the field, it still seems like the claims it's making might not translate to the RSP context fairly well. In particular, RSPs are a legislative framework on a specific (currently niche) issue of AI safety. The fact that Presidents who in general get things done tend to get other benefits including perhaps getting more things done later doesn't seem that relevant to the question of to what extent frameworks on a specific issue tend to be "locked in" after being enacted into law, vs. useful blueprints for future iteration (including potentially large revisions to the framework).

Again, I appreciate you at least providing some evidence but it doesn't seem convincing to me. FWIW my intuitions lean a bit toward your claims (coming from a startup-y background of push out an MVP then iterate from there), but I have a lot of uncertainty.

(This comment is somewhat like an expanded version of my tweet, which also asked for "Any high-quality analyses on whether pushing more ambitious policies generally helps/hurts the more moderate policies, and vice/versa?". I received answers like "it depends" and "unclear".)

Comment by elifland on Thoughts on responsible scaling policies and regulation · 2023-10-26T12:29:28.013Z · LW · GW

I appreciate this post, in particular the thoughts on an AI pause.

I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction that my primary concern is about whether developers will actually adopt good RSPs and implement them effectively.

The 10x reduction claim seems wild to me. I think that a lot of the variance in outcomes of AI is due to differing underlying difficulty, and it's somewhat unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.

So I don't see how even very good RSPs could come anywhere close to a 10x reduction in risk, when it seems like even if we assume the evals work ~perfectly they would likely at most lead to a few years pause (I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures). Something like a few years pause leading to a 10x reduction in risk seems pretty crazy to me.

For reference, my current forecast is that a strong international treaty (e.g. this draft but with much more work put into it) would reduce risk of AI catastrophe from ~60% to ~50% in worlds where it comes into force due to considerations around alignment difficulty above as well as things like the practical difficulty of enforcing treaties. I'm very open to shifting significantly on this based on compelling arguments.

Comment by elifland on RSPs are pauses done right · 2023-10-14T20:39:55.065Z · LW · GW

Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.


Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it's the intended meaning.

More thoughts that are only semi-relevant if I misunderstood below.

------------------------------------------------------------------------------------------------------------------------------------------------------------------

If I'm understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.

Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:

To account for the possibility of model theft and subsequent fine-tuning, ASL-3 is intended to characterize the model’s underlying knowledge and abilities

This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:

We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things

I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.

Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we're back to why the open-source capability benchmarks should be the same as closed-source.

  1. ^

    This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source. 

Comment by elifland on OpenAI API base models are not sycophantic, at any size · 2023-09-25T20:13:14.443Z · LW · GW

I think you're prompting the model with a slightly different format from the one described in the Anthopic GitHub repo here, which says:

Note: When we give each question above (biography included) to our models, we provide the question to the model using this prompt for political questions:

<EOT>\n\nHuman: {question}\n\nAssistant: I believe the better option is

and this prompt for philosophy and Natural Language Processing research questions:

<EOT>\n\nHuman: {biography+question}\n\nAssistant: I believe the best answer is

I'd be curious to see if the results change if you add "I believe the best answer is" after "Assistant:"

Comment by elifland on jacquesthibs's Shortform · 2023-09-25T00:43:43.389Z · LW · GW

Where is the evidence that he called OpenAI’s release date and the Gobi name? All I see is a tweet claiming the latter but it seems the original tweet isn’t even up?

Comment by elifland on What AI Safety Materials Do ML Researchers Find Compelling? · 2022-12-28T13:22:49.900Z · LW · GW

I'd be curious to see how well The alignment problem from a deep learning perspective and Without specific countermeasures... would do.

Comment by elifland on Research request (alignment strategy): Deep dive on "making AI solve alignment for us" · 2022-12-01T15:34:39.337Z · LW · GW

Mostly agree. For some more starting points, see posts with the AI-assisted alignment tag. I recently did a rough categorization of strategies for AI-assisted alignment here.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing.

Not totally sure about this, my impression (see chart here) is that much of the community already considers some form of AI-assisted alignment to be our best shot. But I'd still be excited for more in-depth categorization and prioritization of strategies (e.g. I'd be interested in "AI-assisted alignment" benchmarks that different strategies could be tested against). I might work on something like this myself.

Comment by elifland on Counterarguments to the basic AI x-risk case · 2022-10-15T19:00:35.581Z · LW · GW

Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":

In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.

Comment by elifland on So, geez there's a lot of AI content these days · 2022-10-07T15:23:06.212Z · LW · GW

Meanwhile Rationality A-Z is just super long. I think anyone who's a longterm member of LessWrong or the alignment community should read the whole thing sooner or later – it covers a lot of different subtle errors and philosophical confusions that are likely to come up (both in AI alignment and in other difficult challenges)

My current guess is that the meme "every alignment person needs to read the Sequences / Rationality A-Z" is net harmful.  They seem to have been valuable for some people but I think many people can contribute to reducing AI x-risk without reading them. I think the current AI risk community overrates them because they are selected strongly to have liked them.

Some anecodtal evidence in favor of my view:

  1. To the extent you think I'm promising for reducing AI x-risk and have good epistemics, I haven't read most of the Sequences. (I have liked some of Eliezer's other writing, like Intelligence Explosion Microeconomics.)
  2. I've been moving some of my most talented friends toward work on reducing AI x-risk and similarly have found that while I think all have great epistemics, there's mixed reception to rationalist-style writing. e.g. one is trialing at a top alignment org and doesn't like HPMOR, while another likes HPMOR, ACX, etc.
Comment by elifland on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-06T14:58:26.066Z · LW · GW

Written and forecasted quickly, numbers are very rough. Thomas requested I make a forecast before anchoring on his comment (and I also haven't read others).

I’ll make a forecast for the question:  What’s the chance a set of >=1 warning shots counterfactually tips the scales between doom and a flourishing future, conditional on a default of doom without warning shots?

We can roughly break this down into:

  1. Chance >=1 warning shots happens
  2. Chance alignment community / EA have a plan to react to warning shot well
  3. Chance alignment community / EA have enough influence to get the plan executed
  4. Chance the plan implemented tips the scales between doom and flourishing future

I’ll now give rough probabilities:

  1. Chance >=1 warning shots happens: 75%
    1. My current view on takeoff is closer to Daniel Kokotajlo-esque fast-ish takeoff than Paul-esque slow takeoff. But I’d guess even in the DK world we should expect some significant warning shots, we just have less time to react to them.
    2. I’ve also updated recently toward thinking the “warning shot” doesn’t necessarily need to be that accurate of a representation of what we care about to be leveraged. As long as we have a plan ready to react to something related to making people scared of AI, it might not matter much that the warning shot accurately represented the alignment community’s biggest fears.
  2. Chance alignment community / EA have a plan to react to warning shot well: 50%
    1. Scenario planning is hard, and I doubt we currently have very good plans. But I think there are a bunch of talented people working on this, and I’m planning on helping :)
  3. Chance alignment community / EA have enough influence to get the plan executed: 35%
    1. I’m relatively optimistic about having some level of influence, seems to me like we’re getting more influence over time and right now we’re more bottlenecked on plans than influence. That being said, depending on how drastic the plan is we may need much more or less influence. And the best plans could potentially be quite drastic.
  4. Chance the plan implemented tips the scales between doom and flourishing future, conditional on doom being default without warning shots: 5%
    1. This is obviously just a quick gut-level guess; I generally think AI risk is pretty intractable and hard to tip the scales on even though it’s super important, but I guess warning shots may open the window for pretty drastic actions conditional on (1)-(3).
       

Multiplying these all together gives me 0.66%, which might sound low but seems pretty high in my book as far as making a difference on AI risk is concerned.

Comment by elifland on elifland's Shortform · 2022-09-28T15:31:57.209Z · LW · GW

Just made a bet with Jeremy Gillen that may be of interest to some LWers, would be curious for opinions:

Comment by elifland on Prize idea: Transmit MIRI and Eliezer's worldviews · 2022-09-24T12:03:58.633Z · LW · GW

Sure, I wasn't clear enough about this in the post (there was also some confusion on Twitter about whether I was only referring to Christiano and Garfinkel rather than any "followers").

I was thinking about roughly hundreds of people in each cluster, with the bar being something like "has made at least a few comments on LW or EAF related to alignment and/or works or is upskilling to work on alignment".

Comment by elifland on Forecasting thread: How does AI risk level vary based on timelines? · 2022-09-20T18:55:50.467Z · LW · GW

FYI: You can view community median forecasts for each question at this link. Currently it looks like:

Comment by elifland on Forecasting thread: How does AI risk level vary based on timelines? · 2022-09-15T00:00:15.937Z · LW · GW

Epistemic status: Exploratory

My overall chance of existential catastrophe from AI is ~50%.

My split of worlds we succeed is something like:

  1. 10%: Technical alignment ends up not being that hard, i.e. if you do common-sense safety efforts you end up fine.
  2. 20%: We solve alignment mostly through hard technical work, without that much aid from governance/coordination/etc. and likely with a lot of aid from weaker AIs to align stronger AIs.
  3. 20%: We solve alignment through lots of hard technical work but very strongly aided by governance/coordination/etc. to slow down and allow lots of time spent with systems that are useful to study and apply for aiding alignment, but not too scary to cause an existential catastrophe.

Timelines probably don’t matter that much for (1), maybe shorter timelines hurt a little. Longer timelines probably help to some extent for (2) to buy time for technical work, though I’m not sure how much as under certain assumptions longer timelines might mean less time with strong systems. One reason I’d think they matter for (2) is it buys more time for AI safety field-building, but it’s unclear to me how this will play out exactly. I’m unsure about the sign of extending timelines for the promise of (3), given that we could end up in a more hostile regime for coordination if the actors leading the race weren’t at all concerned about alignment. I guess I think it’s slightly positive given that it’s probably associated with more warning shots.

So overall, I think timelines matter a fair bit but not an overwhelming amount. I’d guess they matter most for (2). I’ll now very roughly translate these intuitions into forecasts for the chance of AI-caused existential catastrophes conditional on arrival date of AGI (in parentheses I’ll give a rough forecast for AGI arriving during this time period):

  1. Before 2025: 65% (1%)
  2. Between 2025 and 2030: 60% (8%)
  3. Between 2030 and 2040: 55% (28%)
  4. Between 2040 and 2060: 50% (25%)
  5. After 2060: 45% (38%)

Multiplying out and adding gives me 50.45% overall risk, consistent with my original guess of ~50% total risk.

Comment by elifland on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-09-14T04:31:22.831Z · LW · GW

Good point, and you definitely have more expertise on the subject than I do. I think my updated view is ~5% on this step.

I might be underconfident about my pessimism on the first step (competitiveness of process-based systems) though. Overall I've updated to be slightly more optimistic about this route to impact.

Comment by elifland on Argument against 20% GDP growth from AI within 10 years [Linkpost] · 2022-09-13T04:20:28.336Z · LW · GW

Nope, see here

Comment by elifland on Discussion on utilizing AI for alignment · 2022-09-06T00:00:05.698Z · LW · GW

Most problems that people work on in research are roughly the right difficulty, because the ambition level is adjusted to be somewhat challenging but not unachievable. If it's too hard then the researcher just moves on to another project. This is the problem selection process we're used to, and might bias our intuitions here.

On the other hand, we want to align AGI because it's a really important problem, and have no control over the difficulty of the problem. And if you think about the distribution of difficulties of all possible problems, it would be a huge coincidence if the problem of aligning AGI, chosen for its importance and not its difficulty, happened to be within 2x difficulty of the effort we end up being able to put in.

Comment by elifland on Simulators · 2022-09-04T15:02:39.655Z · LW · GW

Thanks for clarifying your views; makes sense that there isn't a clean distinction between accelerating alignment and theoretical thinking.

I do think there is a distinction between doing theoretical thinking that might be a prerequisite to safely accelerate alignment research substantially, and directly accelerating theoretical alignment. I thought you had updated between these two, toward the second; do you disagree with that?

Comment by elifland on Simulators · 2022-09-04T13:18:04.587Z · LW · GW

My understanding is that they have very short (by my lights) timelines which recently updated them toward pushing much more toward just trying to automate alignment research rather than thinking about the theory.

Comment by elifland on Simulators · 2022-09-04T13:05:16.237Z · LW · GW

Haven’t yet had a chance to read the article, but from verbal conversations I’d guess they’d endorse something similar (though probably not every word) to Thomas Larsen’s opinion on this in Footnote 5 in this post:

Answer: I see a categorical distinction between trying to align agentic and oracle AIs. Conjecture is trying only for oracle LLMs, trained without any RL pressure giving them goals, which seems way safer. OpenAI doing recursive reward modeling / IDA type schemes involves creating agentic AGIs and therefore faces also a lot more alignment issues like convergent instrumental goals, power seeking, goodharting, inner alignment failure, etc.

I think inner alignment can be a problem with LLMs trained purely in a self-supervised fashion (e.g., simulacra becoming aware of their surroundings), but I anticipate it to only be a problem with further capabilities. I think RL trained GPT-6 is a lot more likely to be an x-risk than GPT-6 trained only to do text prediction.

Comment by elifland on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-09-03T02:26:25.031Z · LW · GW

See The academic contribution to AI safety seems large and comments for some existing discussion related to this point