Open Phil releases RFPs on LLM Benchmarks and Forecasting

lawrencec

Open Phil releases RFPs on LLM Benchmarks and Forecasting

post by LawrenceC (LawChan) · 2023-11-11T03:01:09.526Z · LW · GW · 0 comments

This is a link post for https://www.openphilanthropy.org/research/rfps-on-llm-impacts/

  Benchmarking LLM agents on consequential real-world tasks
  Studying and forecasting the real-world impacts of systems built from LLMs
None
No comments

As linked at the top of Ajeya's "do our RFPs accelerate LLM capabilities" post [LW · GW], Open Philanthropy (OP) recently released two requests for proposals (RFPs):

An RFP on LLM agent benchmarks: how do we accurately measure the real-world, impactful capabilities of LLM agents?
An RFP on forecasting the real world-impacts of LLMs: how can we understand and predict the broader real-world impacts of LLMs?

Note that the first RFP is both significantly more detailed and has narrower scope than the second one, and OP recommends you apply for the LLM benchmark RFP if your project may be a fit for both.

Brief details for each RFP below, though please read the RFPs for yourself if you plan to apply.

Benchmarking LLM agents on consequential real-world tasks

Link to RFP: https://www.openphilanthropy.org/rfp-llm-benchmarks

We want to fund benchmarks that allow researchers starting from very different places to come to much greater agreement about whether extreme capabilities and risks are plausible in the near-term. If LLM agents score highly on these benchmarks, a skeptical expert should hopefully become much more open to the possibility that they could soon automate large swathes of important professions and/or pose catastrophic risks. And conversely, if they score poorly, an expert who is highly concerned about imminent catastrophic risk should hopefully reduce their level of concern for the time being.

In particular, they're looking for benchmarks with the following three desiderata:

Construct validity: the benchmark accurately captures a potential real-world, impactful capability of LLM agents.
Consequential tasks: the benchmark features tasks that will have massive economic impact or can pose massive risks.
Continuous scale: the benchmark improves relatively smoothly as LLM agents improve (that is, they don't go from ~0% performance to >90% like many existing LLM benchmarks have).

Also, OP will do a virtual Q&A session for this RFP:

We will also be hosting a 90-minute webinar to answer questions about this RFP on Wednesday, November 29 at 10 AM Pacific / 1 PM Eastern (link to come).

Studying and forecasting the real-world impacts of systems built from LLMs

Link to RFP: https://www.openphilanthropy.org/rfp-llm-impacts/

This RFP is significantly less detailed, and primarily consists of a list of projects that OP may be willing to fund:

To this end, in addition to our request for proposals to create benchmarks for LLM agents, we are also seeking proposals for a wide variety of research projects which might shed light on what real-world impacts LLM systems could have over the next few years.

Here's the full list of projects they think could make a strong proposal:

Conducting randomized controlled trials to measure the extent to which access to LLM products can increase human productivity on real-world tasks. For example:
Polling members of the public about whether and how much they use LLM products, what tasks they use them for, and how useful they find them to be.
In-depth interviews with people working on deploying LLM agents in the real world.
Collecting “in the wild” case studies of LLM use, for example by scraping Reddit (e.g. r/chatGPT), asking people to submit case studies to a dedicated database, or even partnering with a company to systematically collect examples from consenting customers.
Estimating and collecting key numbers into one convenient place to support analysis.
Creating interactive experiences that allow people to directly make and test their guesses about what LLMs can do.
Eliciting expert forecasts about what LLM systems are likely to be able to do in the near future and what risks they might pose.
Synthesizing, summarizing, and analyzing the various existing lines of evidence about what language model systems can and can’t do at present (including benchmark evaluations, deployed commercial uses, and qualitative case studies, etc) and what they might be able to do soon to arrive at an overall judgment about what LLM systems are likely to be able to do in the near term.

There's no Q&A session for this RFP.

0 comments

Comments sorted by top scores.

Open Phil releases RFPs on LLM Benchmarks and Forecasting

Contents

Benchmarking LLM agents on consequential real-world tasks

Studying and forecasting the real-world impacts of systems built from LLMs

0 comments