GPT-4 beats humans at hard rhyme-based riddles

For the past eight months, I’ve been running the rhyme-based daily word game Twofer Goofer. At this point, more than 10,000 users have solved the 240 distinct puzzles (one each day since we launched) more than 100,000 times.

Twofer Goofers are not meant to be a vocabulary or trivia contest, but rather a whimsical game that rewards creative non-linear thinking. By including rhymes, deception, and word play, these puzzles tend to be more challenging than the day's Wordle. Here’s an example of a median difficulty Twofer Goofer:

If stumped, users can access a few clues:

Number of syllables per word
Blurred image depicting the Twofer Goofer
First letters in each word
A more direct prompt for the Twofer Goofer, e.g. A fancy leather slipper worn by a hole-digging, big-toothed mammal.

This answer? Gopher Loafer. We then reward users for solving the puzzle with Midjourney-generated art each day.

A few months back, GPT-3.5 solved Twofer Goofers at a lower rate than our human players. With GPT-4’s release this week, we re-ran the test more formally.

How we evaluted GPT-4's ability to solve Twofer Goofers

Evaluation notes:

I’m using ChatGPT as the interface for GPT-3.5 and GPT-4.
I’m running separate chats with the GPT-3.5 and GPT-4 models.
ChatGPT doesn’t really “understand” either rhyming or syllable counts.

The dataset:

I tested against the first 100 Twofer Goofers, which were the daily puzzles from July to October 2022.
These puzzles have been collectively attempted by human users 40,000 times (and counting, as the archive is playable).

Setting up the test:

I explained Twofer Goofer’s rules to ChatGPT with five examples, e.g.
- Twofer Goofer: Globe Strobe
- Prompt: A recognizable sphere flashing intermittently in a nightclub.
- Alternate prompt: A 3D geography tool doubling as a blinking light.
Full rules explanation that I shared with ChatGPT.
ChatGPT started with the prompt and number of letters (same start as users).
Human users have unlimited guesses and can access clues at any time.
For ChatGPT, I allowed for a single guess per clue.

Here’s an example of what I’d send after ChatGPT had an incorrect guess:

The results of the test

TLDR:

GPT-3.5 is worse than human players at solving Twofer Goofers.
GPT-4 is better than human players at solving Twofer Goofers.

To compare users against ChatGPT, I had to normalize the scoring.

For a given puzzle, the human solve rate reflects (# of users that solved the puzzle) / (# of players that attempted the puzzle). Averaging across the 100 puzzles gives an overall solve rate of 82% for human users. Notably, the puzzles range widely in difficulty, from a 97% solve rate:

to this one with a 45% solve rate (whoops! My Midwestern background is showing):

For ChatGPT, the solve rate on a single puzzle is either 100% or 0%. ChatGPT's overall solve rate is simply the number of Twofers solved (out of 100).

While GPT-3.5 underperformed users by 10 percentage points, GPT-4 solved nearly every single puzzle (only missing four). GPT-4 also required significantly fewer clues than either users or GPT-3.5.

Particularly interesting were the puzzles GPT-4 solved immediately while GPT-3.5 remained stumped even after using all clues:

Although omitted from the results table, comparing speed is another interesting lens for assessment. The median user solves a given puzzle in ~50 seconds, while GPT-4 always solved in <10 seconds. It harkens back to the unfair buzzer advantage IBM Watson had playing Jeopardy! back in 2011.

Here’s the full set of ChatGPT responses.

Can GPT-4 help us run Twofer Goofer?

Each night, I vet the next day’s puzzle by sharing it with the company’s VP (Voice of the People). She’s also my wife. She’ll veto or edit Twofers that she thinks are too hard, “too cute,” or just unfun.

I wanted to know if ChatGPT could become the new Voice of the People. Can I pre-test how hard a Twofer will be by asking ChatGPT to solve it first?

Because GPT-4 solves nearly every Twofer, I tested for correlation between the number of clues ChatGPT uses compared to both the human solve rate and the number of clues used by humans.

It turns out there is a positive correlation between ChatGPT-clues and human-clues. This means when ChatGPT uses more clues, so do users. However, the correlation is only 17%. A perfect correlation (1 more clue used by ChatGPT always = 1 more clue used by humans) would be represented by 100%.

Notably, GPT-4's solving abilities are more correlated with humans than GPT-3.5's, but it's still a low correlation. For now, our Voice of the People will keep her job.

What about writing Twofer Goofers?

For 240 straight nights I’ve written a Twofer Goofer (yes, I should write in bulk, but procrastinators gonna procrastinate). "Writing" means:

Coming up with a new pair of rhyming words
Determining the visual concept/image the pair describes
Writing the first prompt
Writing a second, easier prompt (without repeating words)
Making the day’s art via Midjourney

After quizzing GPT-4 and GPT-3.5 on Twofer Goofers, I hoped they had ample training data to “understand” the game. So I asked each to generate a list of ten Twofer Goofers.

Yet again, GPT-4 is a massive step-up in functionality.

Six of the ten written by GPT-4 are acceptable Twofer Goofers with perfect rhyme. Only two out of ten from GPT-3.5 were acceptable.

Compare an example of GPT-4’s output:

Answer: Thunder Blunder
Prompt: A loud atmospheric phenomenon causing a foolish mistake.
Alternate Prompt: A noisy weather event leading to an embarrassing error.

to an example of GPT-3.5’s output:

Answer: Doggy Doo( r )
Prompt: A metallic door for a furry, four-legged friend.
Alternate Prompt: A puppy's personal entryway.

Here's the full output of puzzles written by the robots.

Will I start using GPT-4 to write the Twofer Goofers? Not quite yet. There’s still editorializing required to fit the prompts to an image, to have both prompts build on the image's concept of the given image, and to inject a consistent sense of personality into the wording.

Final musings on creativity and curation

Curation: The internet brought the marginal cost of distribution to ~nothing. Multimodal, generative AI brings the marginal cost of "creation" to ~nothing. This compounds the value of curation, and I believe presages a multi-year (decade?) tailwind on the importance of curation, gatekeeping, and editorializing. What are the Wirecutters for a generative AI-filled world? Does personalized creation negate this curation tailwind?

Creativity: These tools (ChatGPT, Midjourney, etc) are also forcing us to reckon with the definition of creativity. Two years ago my grandma casually told me she was so proud to have two creative grandchildren. She wasn’t talking about me. She was talking about cousins who make art or music. What is creativity? We consider photography creative now; were early photographers creatives or technologists? Is coming up with a pair of rhyming words creative? Is cleverly prompting Midjourney creative? What about conceptualizing the image for that pair of rhyming words, like when Carriage Marriage featured married frogs instead of humans? I know not the answers. To me, there is creativity in running Twofer Goofer. Creativity that includes mish-mashing together separate ideas, imbuing originality, and creating surprise.

Thanks for reading: give the game a whirl! And send feedback to [email protected]