Research directions Open Phil wants to fund in technical AI safety

post by jake_mendel, maxnadeau, Peter Favaloro (peter.favaloro@gmail.com) · 2025-02-08T01:40:00.968Z · LW · GW · 21 comments

This is a link post for https://www.openphilanthropy.org/tais-rfp-research-areas/

Contents

  Synopsis
    Adversarial machine learning 
    Exploring sophisticated misbehavior in LLMs
    Model transparency
    Trust from first principles
    Alternative approaches to mitigating AI risks
  Research Areas
    *Jailbreaks and unintentional misalignment
    *Control evaluations
    *Backdoors and other alignment stress tests
    *Alternatives to adversarial training
    Robust unlearning
    *Experiments on alignment faking
    *Encoded reasoning in CoT and inter-model communication
    Black-box LLM “psychology”
    Evaluating whether models can hide dangerous behaviors
    Reward hacking of human oversight
    *Applications of white-box techniques
    Activation monitoring
    Finding feature representations
    Toy models for interpretability
    Externalizing reasoning
    Interpretability benchmarks
    †More transparent architectures
    White-box estimation of rare misbehavior
    Theoretical study of inductive biases
    †Conceptual clarity about risks from powerful AI
    †New moonshots for aligning superintelligence
None
21 comments

The Open Philanthropy has just launched [LW · GW] a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety. 

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within or across research areas.

For each research area, we include:

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality. 

Synopsis

In this section we briefly orient readers to the 21 research areas that we’ll discuss in more detail below. For ease of consumption, we’ve grouped them into 5 rough clusters, though of course there is overlap and ambiguity in how to categorize each research area.

Our favorite topics are marked with a star (*) – we’re especially eager to fund work in these areas. In contrast, we will have a high bar for topics marked with a dagger (†).

Adversarial machine learning 

This cluster of research areas uses simulated red-team/blue-team exercises to expose the vulnerabilities of an LLM (or a system that incorporates LLMs). Across these directions, a blue team attempts to make an AI system adhere with very high reliability to some specification of its safety behavior, and then a red team attempts to find edge cases that violate the specification. We think this adversarial style of evaluation and iteration is necessary to ensure an AI system has a low probability of catastrophic failure. Through these research directions, we aim to develop robust safety techniques that mitigate risks from AIs before those risks emerge in real-world deployments.

Exploring sophisticated misbehavior in LLMs

Future, more capable AI models might exhibit novel failure modes that are hard to detect with current methods – for instance, failure modes that involve LLMs reasoning about their human developers or becoming optimized to deceive flawed human assessors. We want to fund research that identifies the conditions under which these failure modes occur, and makes progress toward robust methods of mitigating or avoiding them. 

Model transparency

We see potential in the idea of using a network’s intermediate representations to predict, monitor, or modify its behavior. Some approaches are feasible without an understanding of the model’s learned mechanisms, while other techniques may become possible with the invention of interpretability methods that more comprehensively decompose an AI’s internal mechanisms into components that can be understood and intervened on individually. We’re interested in funding research across this spectrum — everything from useful kludges to new ideas for making models more transparent and steerable. 

Trust from first principles

We trust nuclear power plants and orbital rockets through validated theories that are principled and mechanistic, rather than through direct trial-and-error. We would benefit from similarly systematic, principled approaches to understanding and predicting AI behavior. One approach to this is model transparency, as in the previous cluster. But understanding may not be a necessary condition: this cluster aims to get the safety and trust benefits of interpretability without humans having to understand any specific AI model in all its details. 

Alternative approaches to mitigating AI risks

These research areas lie outside the scope of the clusters above.

Research Areas

*Jailbreaks and unintentional misalignment

*Control evaluations

*Backdoors and other alignment stress tests

*Alternatives to adversarial training

Robust unlearning

*Experiments on alignment faking

*Encoded reasoning in CoT and inter-model communication

Black-box LLM “psychology”

Evaluating whether models can hide dangerous behaviors

Reward hacking of human oversight

*Applications of white-box techniques

Activation monitoring

Finding feature representations

Toy models for interpretability

Externalizing reasoning

Interpretability benchmarks

†More transparent architectures

White-box estimation of rare misbehavior

Theoretical study of inductive biases

†Conceptual clarity about risks from powerful AI

†New moonshots for aligning superintelligence

21 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2025-02-08T09:56:28.531Z · LW(p) · GW(p)

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimization pressure against it, but that isn't described as a core desideratum, and I don't expect we get it. Work on externalized reasoning can also make alignment easier, but I'm not counting that as "directly relevant".) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)

Areas that I'm more excited about relative to the median area in this post (including some of your starred areas):

  • Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
  • Mild optimization. I'm particularly interested in MONA [LW · GW] (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
  • Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.

(And tbc there are plenty of other areas directly relevant to alignment that I'm less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and "gradient descent psychology" (empirically studying how fine-tuning techniques affect LLM behavior).)

Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system -- all of the theory depends on equilibrium behavior which you only get via training.)

Replies from: Buck, Fabien, Buck, oliver-daniels-koch
comment by Buck · 2025-02-08T16:46:37.549Z · LW(p) · GW(p)

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-02-08T19:04:46.856Z · LW(p) · GW(p)

The idea I associate with scalable oversight is weaker models overseeing stronger models (probably) combined with safety-by-debate.  Is that the same or different from " recursive techniques for reward generation" ?

Currently, this general class of ideas seems to me the most promising avenue for achieving alignment for vastly superhuman AI (' superintelligence' )..

comment by Fabien Roger (Fabien) · 2025-02-08T19:10:02.384Z · LW(p) · GW(p)
  • Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
  • Mild optimization. I'm particularly interested in MONA [LW · GW] (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.

I think additional non-moonshot work in these domains will have a very hard time helping.

[low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters.

[edit: I was at least somewhat wrong, see comment threads below]

Here are things that would change my mind:

  • If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work.
    • (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.)
  • If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help.
    • (In the different domain of  CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that labs don't build techniques that make CoT more opaque if it just slightly increased performance because I expect that it will be easier to justify. I think GDM's updated safety framework is a good step in this direction as it hints at additional requirements GDM may have to fulfill if it wanted to deploy models with opaque CoT past a certain level of capabilities.)
  • If I thought that research directions included in the cluster you are pointing at were making progress towards speeding up capabilities in safety-critical domains (e.g. conceptual thinking on alignment, being trusted neutral advisors on geopolitical matters, ...) relative to baseline methods (i.e. the sort of RL you would do by default if you wanted to make the model better at the safety-critical task if you had no awareness of anything people did in the safety literature).

I am not very aware of what is going on in this part of the AI safety field. It might be the case that I would change my mind if I was aware of certain existing pieces of work or certain arguments. In particular I might be too skeptical about progress on methods for things like debate and process-based supervision - I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

It's also possible that I am missing an important theory of change for this sort of research.

(I'd still like it if more people worked on alignment: I am excited about projects that look more like the moonshots described in the RFP and less like the kind of research I think you are pointing at.)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-08T22:35:56.917Z · LW(p) · GW(p)

I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is "how big is the alignment tax", and I agree we don't know the answer to that yet and may not have enough understanding by the time it is relevant, but I don't really see why one would think "nah it'll only work in toy domains".

I agree debate doesn't work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a "toy" one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).

It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.

It's wild to me that you're into moonshots when your objection to existing proposals is roughly "there isn't enough time for research to make them useful". Are you expecting the moonshots to be useful immediately?

Replies from: Fabien, Fabien
comment by Fabien Roger (Fabien) · 2025-02-09T14:14:52.859Z · LW(p) · GW(p)

Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are:

  • on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won't use them by default?
  • on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality?
  • on the path to speeding up safety-critical domains?
  • (valuable for some other reason?)
Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-09T18:25:43.251Z · LW(p) · GW(p)

I have some credence in all three of those bullet points.

For MONA it's a relatively even mixture of the first and second points.

(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)

For debate it's mostly the first point, and to some extent the third point.

Replies from: Fabien, ryan_greenblatt
comment by Fabien Roger (Fabien) · 2025-02-10T12:34:11.617Z · LW(p) · GW(p)

Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality

That's right. If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it.

In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won't have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if

  • the alignment tax of MONA were tiny
  • there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA

I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don't disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-11T23:34:38.277Z · LW(p) · GW(p)

If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point

That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.

For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we'd penalize if we knew about it, but we don't realize that's happening). Later when things are put into production errors happen, but it's chalked up to "well it's hard to anticipate everything".

Instead you use MONA, and it doesn't learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent's lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.

Replies from: Fabien
comment by Fabien Roger (Fabien) · 2025-02-14T10:38:10.142Z · LW(p) · GW(p)

To rephrase what I think you are saying are situations where work on MONA is very helpful:

  • By default people get bitten by long-term optimization. They notice issues in prod because it's hard to catch everything. They patch individual failures when they come up, but don't notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don't try MONA-like techniques because it's not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
  • If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don't by default think of the individual failures you've seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk - though given competitive pressures, that might not be the main factor driving decisions and so you don't have to make an ironclad case for MONA reducing catastrophic risk).

Is that right?

I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-14T10:54:00.570Z · LW(p) · GW(p)

Is that right?

Yes, that's broadly accurate, though one clarification:

This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly

That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.

I think this will become much more likely once we actually start observing long-term optimization failures in prod.

Agreed, we're not advocating for using MONA now (and say so in the paper).

Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?

Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)

comment by ryan_greenblatt · 2025-02-09T19:17:00.436Z · LW(p) · GW(p)

You are possibly the first person I know of who reacted to MONA with "that's obvious"

I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-10T05:22:36.115Z · LW(p) · GW(p)

I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").

Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

comment by Fabien Roger (Fabien) · 2025-02-09T13:52:07.270Z · LW(p) · GW(p)

I don't really see why one would think "nah it'll only work in toy domains".

That is not my claim. By "I would have guessed that methods work on toy domains and math will be useless" I meant "I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be "we did process-based supervision"."

I did not say "methods that work on toy domains will be useless" (my sentence was easy to misread).

I almost have the opposite stance, I am closer to "it's so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don't increase the probability that people working on capabilities implement myopic objectives." 

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the "capability researcher does what looks easiest and sensible to them" baseline)?

I agree debate doesn't work yet

Not a crux. My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it's not high priority for people working on x-risk safety to speed up that work.

But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-09T18:23:40.278Z · LW(p) · GW(p)

Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings

I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generalizable lessons to be learned.

I agree that there's not much benefit in "methods work" if that is understood as "work on the algorithm / code that given data + rewards / approvals translates it into gradient updates". I care a lot more about iterating on how to produce the data + rewards / approvals.

My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.

I'd weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)

For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.

I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)

EDIT: And tbc this is the kind of thing I mean by "improving average-case feedback quality". I now feel like I don't know what you mean by "feedback quality".

Replies from: Fabien
comment by Fabien Roger (Fabien) · 2025-02-10T12:21:31.937Z · LW(p) · GW(p)

I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)

My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don't show up in feedback quality evals.

I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.

comment by Buck · 2025-02-08T16:47:59.462Z · LW(p) · GW(p)
  • Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2025-02-08T22:11:40.545Z · LW(p) · GW(p)

I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.

comment by Oliver Daniels (oliver-daniels-koch) · 2025-02-09T21:13:40.376Z · LW(p) · GW(p)

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

I have a similar confusion (see my comment here [LW(p) · GW(p)]) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)

comment by Towards_Keeperhood (Simon Skade) · 2025-02-12T18:59:40.559Z · LW(p) · GW(p)

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality. 

Did you consider to instead commit to giving out retroactive funding for research progress that seems useful?

Aka that people could apply for funding for anything done from 2025, and then you can actually better evaluate how useful some research was, rather than needing to guess in advance how useful a project might be. And in a way that quite impactful results can be paid a lot, so you don't disincentivize low-chance-high-reward strategies. And so we get impact market dynamics where investors can fund projects in exchange for a share of the retroactive funding in case of success.

There are difficulties of course. Intuitively this retroactive approach seems a bit more appealing to me, but I'm basically just asking whether you considered it and if so why you didn't go with it.

comment by Towards_Keeperhood (Simon Skade) · 2025-02-12T18:50:26.684Z · LW(p) · GW(p)

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality. 

Side question: How much is Openphil funding LTFF? (And why not more?)

(I recently got an email from LTFF which suggested that they are quite funding constraint. And I'd intuitively expect LTFF to be higher impact per dollar than this, though I don't really know.)