A Continuation of the Opus 4.8 vs. GPT-5.5 Rematch
In the last round I compared Opus 4.8 in Claude Code with GPT-5.5 in Codex on the same 3D maze runner challenge I had used before. Both did well. GPT-5.5 was faster, Opus 4.8 handled the player mode better, and both were already far ahead of the older Opus 4.5 vs. Kimi 2.5 run.
But the model I originally wanted to test in that rematch was Fable. I missed it by a day. So this is the follow-up: the same challenge, this time with Fable 5.
Fable 5
This run was slower than the previous two, but also more methodical. Fable 5 spent more time up front planning and testing, and that became the defining difference.
Timeline:
- 01:50 - Implementation plan ready.
- 05:35 - First code and config files written.
- 06:15 - npm dependencies installing.
- 08:23 - Test cases being created.
- 09:01 - Unit testing starts.
- 09:12 - Unit tests complete.
- 09:14 - Browser tests start.
- 10:06 - Playwright browser window appears.
- 11:11 - Code update, then Playwright again.
- 12:23 - Another code update and Playwright run.
- 18:45 - Finished.
The final report was unusually comprehensive. The important detail is that it did not simply fake the run by walking the optimal path. It kept an A* solver for reference, but the actual runner explored the maze physically and counted both forward and backward movement. In one witnessed run it scored 746 points against a 628-move optimal path. Other witnessed runs finished with 458 and 694 points.
Verification was also stronger than in the earlier runs:
- 13/13 unit tests passed.
- 7/7 Playwright end-to-end tests passed.
- Three complete maze solves were witnessed live.
- Console errors were checked.
- A favicon 404 was fixed (it's now a PacMan svg)
That last point may sound small, but it matters. The earlier comparisons were mostly about whether the model could build the thing. Fable 5 behaved more like it was trying to ship the thing.
Visual Results
The welcome screen is simple and polished: dark background, yellow title, and a single "Create Maze" button.
The generated maze view is crisp and practical. After generation, it shows both "Start Maze Runner" and "Play Yourself" buttons.
The 3D mode has the expected first-person corridor view, a minimap in the top-left corner, a digital point counter in the top-right corner, and visible graffiti on the wall. The graffiti is not just technically present; it is actually readable.
The victory modal shows both total points and the optimal path length. In the screenshot below, the run completed with 534 points against a 500-move optimal path.
The Player Mode Follow-Up
As with the previous test, the automated run finished too quickly to capture everything comfortably. So I asked for a manual player mode. It also expanded the test suite:
- 18 unit tests passed.
- 10 Playwright tests passed.
- New tests covered WASD movement, wall blocking, and a full keyboard playthrough.
- The full keyboard playthrough computed the shortest path and drove the player to the goal with roughly 900 real keypresses.
Fable 5 was not the fastest contestant in this round, but it produced the cleanest engineering process.
Overall
Compared with Opus 4.8 and GPT-5.5, Fable 5 took longer:
| Metric | Opus 4.8 | GPT-5.5 | Fable 5 |
|---|---|---|---|
| First complete version | 14:42 | 11:52 | 18:45 |
| With player mode | After follow-up | After follow-up | 36:59 total |
| Asked clarifying questions | Yes | No | No, but produced a detailed plan |
| Wall graffiti | Yes | Needed reminder | Yes |
| Graffiti quality | Nicer | More visible | Clear and readable |
| Player mode | First-person | Directional movement | First-person WASD |
| Unit tests | Not the highlight | Not the highlight | Extensive |
| Playwright verification | Some | Some | Extensive |
The headline is not that Fable 5 won on speed. It did not.
The headline is that the old benchmark is getting exhausted. The first versions of this test were good at exposing basic failures: incomplete builds, missing requirements, poor visuals, or agents that could not really verify their own work. That is becoming less interesting. Fable 5 not only built the app, it planned the architecture, wrote unit tests, ran browser tests, fixed issues it found, added manual playability, and expanded the test suite for the new feature.
At this point I have outlived the prompt.
The models have become much more capable, and the 3D maze runner no longer separates them the way it used to. The next test needs to be harder: less toy-like, more stateful, more ambiguous, and probably closer to a real product workflow.
That, in itself, is the result.




Comments
Post a Comment