meemi's Shortform

post by meemi (meeri) · 2025-01-18T13:19:43.908Z · LW · GW · 10 comments

Contents

10 comments

10 comments

Comments sorted by top scores.

comment by meemi (meeri) · 2025-01-18T13:19:43.998Z · LW(p) · GW(p)

FrontierMath was funded by OpenAI.[1]

The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection.

Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]

Because the Arxiv version mentioning OpenAI contribution came out right after o3 announcement, I'd guess Epoch AI had some agreement with OpenAI to not mention it publicly until then.

The mathematicians creating the problems for FrontierMath were not (actively)[2] communicated to about funding from OpenAI. The contractors were instructed to be secure about the exercises and their solutions, including not using Overleaf or Colab or emailing about the problems, and signing NDAs, "to ensure the questions remain confidential" and to avoid leakage. The contractors were also not communicated to about OpenAI funding on December 20th. I believe there were named authors of the paper that had no idea about OpenAI funding.

I believe the impression for most people, and for most contractors, was "This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem."[3]

Now Epoch AI or OpenAI don't say publicly that OpenAI has access to the exercises or answers or solutions. I have heard second-hand that OpenAI does have access to exercises and answers and that they use them for validation. I am not aware of an agreement between Epoch AI and OpenAI that prohibits using this dataset for training if they wanted to, and have slight evidence against such an agreement existing.

In my view Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work being used for capabilities, when choosing whether to work on a benchmark.

  1. ^

    Arxiv v5 (Dec 20th version) "We gratefully acknowledge OpenAI for their support in creating the benchmark."

  2. ^

    I do not know if they have disclosed it in neutral questions about who is funding this.

  3. ^

    This is from a comment by a non-Epoch AI person on HackerNews from two months ago. Another example: Ars Technica writes "FrontierMath's difficult questions remain unpublished so that AI companies can't train against it." in a news article from November.

Replies from: Tamay, Kei, ouguoc, havard-tveit-ihle, dan-hendrycks, leogao
comment by Tamay · 2025-01-19T02:45:44.014Z · LW(p) · GW(p)

Tamay from Epoch AI here.

We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access. We own this error and are committed to doing better in the future.

For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn't do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.

Getting permission to disclose OpenAI's involvement only around the o3 launch wasn't good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.

Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training. 

Relevant OpenAI employees’ public communications have described FrontierMath as a 'strongly held out' evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.

OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose. 

[Edit: Clarified OpenAI's data access - they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.]

Replies from: ozziegooen
comment by ozziegooen · 2025-01-19T03:44:51.283Z · LW(p) · GW(p)

I found this extra information very useful, thanks for revealing what you did.

Of course, to me this makes OpenAI look quite poor. This seems like an incredibly obvious conflict of interest.

I'm surprised that the contract didn't allow Epoch to release this information until recently, but that it does allow Epoch to release the information after. This seems really sloppy for OpenAI. I guess they got a bit extra publicity when o3 was released (even though the model wasn't even available), but now it winds up looking worse (at least for those paying attention). I'm curious if this discrepancy was maliciousness or carelessness. 

Hiding this information seems very similar to lying to the public. So at very least, from what I've seen, I don't feel like we have many reasons to trust their communications - especially their "tweets from various employees."

> However, we have a verbal agreement that these materials will not be used in model training. 
I imagine I can speak for a bunch of people here when I can say I'm pretty skeptical. At very least, it's easy for me to imagine situations where the data wasn't technically directly used in the training, but was used by researchers when iterating on versions, to make sure the system was going in the right direction. This could lead to a very blurry line where they could do things that aren't [literal LLM training] but basically achieve a similar outcome. 

comment by Kei · 2025-01-18T21:46:35.704Z · LW(p) · GW(p)

EpochAI is also working on a "next-generation computer-use benchmark". I wonder who is involved in funding that. It could be OpenAI given recent rumors they are planning to release a computer-use model early this year.

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2025-01-19T03:09:43.743Z · LW(p) · GW(p)

Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this new benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.

comment by ouguoc · 2025-01-18T20:36:27.336Z · LW(p) · GW(p)

Thanks for posting this!

I have to admit, the quote here doesn't seem to clearly support your title -- I think "support in creating the benchmark" could mean lots of different things, only some of which are funding. Is there something I'm missing here?

Regardless, I agree that FrontierMath should make clear what the extent was of their collaboration with OpenAI. Obviously the details here are material to the validity of their benchmark.

comment by Håvard Tveit Ihle (havard-tveit-ihle) · 2025-01-18T13:41:56.444Z · LW(p) · GW(p)

Why do you consider it unlikely that companies could (or would) fish out the questions from API-logs?

Replies from: meeri
comment by meemi (meeri) · 2025-01-18T13:51:29.700Z · LW(p) · GW(p)

That was a quote from a commenter in Hacker news, not my view. I reference the comment as something I thought a lot of people's impression was pre- Dec 20th. You may be right that maybe most people didn't have the impression that it's unlikely, or that maybe they didn't have a reason to think that. I don't really know.

Thanks, I'll put the quote in italics so it's clearer.

comment by Dan H (dan-hendrycks) · 2025-01-19T01:37:21.341Z · LW(p) · GW(p)

It's probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)

Also can confirm they aren't giving access to the mathematicians' questions to AI companies other than OpenAI like xAI.

comment by leogao · 2025-01-19T03:56:47.438Z · LW(p) · GW(p)

this doesn't seem like a huge deal