Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

post by Arjun Panickssery (arjun-panickssery), agg (ag) · 2024-01-15T21:21:03.962Z · LW · GW · 0 comments

Contents

No comments

This is a summary of https://arxiv.org/abs/2401.05604.

When Google announced Gemini Pro, they displayed its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images.

We introduce a new benchmark (Github) evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories like movies, composers, major cities, and food. 

The REBUS dataset highlights several key challenges in multimodal language models:

We find that the proprietary models GPT-4V and Gemini Pro significantly outperform all other tested models, but even they only achieve scores of 24% and 13.2%, respectively. Models rarely understand all parts of a puzzle, and they almost always fail to explain their correct answers with correct justifications.

0 comments

Comments sorted by top scores.