- GPT-4 solves Twofer Goofers at a 96% rate
- Humans solve at an 82% rate
- Bard solves at, essentially, a 0% rate
Wait, what's a Twofer Goofer?
Twofer Goofers are daily pairs of rhyming words described by a roundabout prompt. Players use the prompt and a series of clues to solve the puzzle and are rewarded with a piece of custom art. At this point, more than 12,000 human users have cumulatively solved the 240 distinct puzzles more than 100,000 cumulative times.
Here's an example of a solved Twofer Goofer:
Back to the chatbots
In fancy terms: we have a proprietary dataset of human puzzle-solving data against which we can test these AI tools.
In normal terms: it's fun to see if the robots can figure out the creative non-linear thinking required to solve rhyme-based riddles. It's particularly fun because the robots don't actually understand what rhyming is.
The results from last week's test (full blog post here):
- Human users solve about 82% of Twofer Goofers, using an average of 1.6 clues
- GPT-4 is much better than humans, solving 96% of the puzzles and needing only 0.9 clues
- GPT-3.5 is impressive, but worse than humans at a 72% solve rate with 2.0 clues per puzzle
I was thrilled to toss Bard into the fray after gaining access to the open beta today. However, the results were shockingly disappointing.
Bard was not able to solve a single Twofer Goofer when given the prompt. It was close in a couple instances, but ultimately unsuccessful.
Here's Bard's attempt at the first 20 Twofer Goofers:
Even without seeing the prompts, you can tell these are incorrect guesses because they aren't pairs of rhyming words. Ultimately, Bard's first attempt at all 100 puzzles was a failure.
On a handful of Twofers, I ran through the full gauntlet of clues, but Bard still failed. Here's one of the easiest puzzles (as evidenced by a user solve rate of 97%):
GPT-4 and GPT-3.5 solved this puzzle immediately. Here's Bard's attempt(s):
Note: Our GPT-4 blog post has much more detail about exactly how we ran this test, including the documentation of the exact prompting used.
But these are hard puzzles to solve, even for humans!
Indeed, but the concept of rhyming isn't too difficult for humans. (Though Twofer Goofer HQ's adherence to strict "perfect" rhyme can be tricky for those slant rhyme-inclined.) Regardless, Bard's understanding of rhyming is meaningfully behind ChatGPT, as evidenced by this hastily-conceived test.
What does this all mean?
Not much! But we've seen many "empirical" tests quoted to demonstrate the quality of a given AI model (LSAT scores, etc). However, we all know that humans thrive at creativity, non-linear thinking, and conceptual understanding (like what rhymes are!). Tests on data like Twofer Goofer solve rates are a valuable way to assess the true progress of these tools.
More to come! Send any feedback or complaints or musings to [email protected]