Skip to main content

Posts

Fable 5 Joins the 3D Maze Runner Race

Recent posts

Your Agent Framework Doesn’t Matter - Your Data Boundary Does

I recently worked through a practical problem that many enterprise teams will run into when using LLMs: how do you use powerful frontier models without exposing proprietary data unnecessarily? I cannot describe the exact use case, because it involves internal and proprietary information. So the example below is invented. But the architectural issue is the same. Imagine a financial-services workflow where a system receives an abbreviated security description such as: “UBS Grp 4.20% Call Sr Nts 30” and needs to resolve it into something more explicit: “UBS Group AG, 4.20% Callable Senior Notes, maturity 2030.” The obvious way to solve this is to give the task to a frontier model. It will probably do a good job. But in a real enterprise setting, the question is not only whether the model can solve the task. The more important question is: what else does the model see while solving it? That is where the architecture matters. There are at least two common ways to build this kind...

Opus 4.8 vs. GPT-5.5: The 3D Maze Runner Rematch

A Head-to-Head Comparison of AI Coding — Round Two Introduction A while back I pitted Opus 4.5 against Kimi 2.5 in a 3D maze runner build-off. This is the rematch. I had hoped to run this round with Fable, but I was a day too late — so Opus 4.8 took the seat instead. The harnesses were native to each model: Codex for GPT-5.5 and Claude Code for Opus 4.8. Same challenge as before: build a complete 3D maze runner from scratch. Opus 4.8 (Claude Code) 00:30 — Claude Code opens by asking me questions about the framework and algorithm to use. 05:20 — Thinking done. It leaves plan mode and starts writing code. 08:51 — Build succeeds; installing Chromium. 09:53 — Running through the maze, it detects graphical issues. 11:04 — Graphical issues appear fixed; it moves on to check for graffiti on the walls. 12:31 — It decides the graffiti should “desaturate in dim corridors,” so it makes them glow more. 14:01 — Running final tests and cleanup; one last production build. 14:42 ...

Opus 4.5 vs. Kimi 2.5: The 3D Maze Runner Showdown

A Head-to-Head Comparison of AI Coding Introduction Over the weekend, I put two leading AI models to the test: Opus 4.5 (using Claude Code) and Kimi 2.5 (using Opencode). The challenge? Build a complete 3D maze runner game from scratch, including maze generation, A* pathfinding, Three.js visualization, and automated testing. Both systems received identical prompts and were timed from start to finish. What followed was a fascinating race that revealed important differences in how these AI systems approach complex coding tasks, handle context windows, and recover from errors. The Challenge Both AI assistants were given the same comprehensive prompt requiring them to build a web-based 3D maze demo with maze generation, 3D visualization, minimap, A* pathfinding, and Playwright tests. The Original Prompt (excerpt) # 3D Maze Runner Demo - Coding Assistant Prompt ## Overview Create a complete web-based 3D maze demo using Three.js as the visualization layer. The application shou...

MCP + Context: engineering for the context – hard lessons learned

  Intro I have built my own orchestration framework because most of what I’ve seen was too complex or tried to lock you into creating workflows a certain way. I wanted something very simple and yet maximally flexible. I’m not going into details here on the framework — that’s another blog post — but I will in some cases explain why I could do what I did thanks to the flexibility of the framework, which is a dynamic DAG, can do call-backs, and uses functions and MCP servers. I will also not explain in detail what I’m doing with my current workflow, other than to say I was looking for a way to bypass large language models and instead run it on my own system at home. I succeeded with that — but that’s another blog post. Instead, what I will try to explain in this post is the most important thing after prompt engineering: context engineering, and why it’s so crucial to manage that aspect (especially when you run this at home). Stage setting A couple of weeks ago, Anthropic posted this: ...

Antigravity + Gemini 3 are serious contenders

  The News Google’s rebirth as the AI behemoth it was always meant to be has finally come to fruition. With the release of Gemini 3, Google has made it clear that it can compete with—and even surpass—OpenAI and Anthropic. And now it seems everyone and their mother is shipping AI-enhanced IDEs which, in almost all cases, are forks of VS Code. Google’s offering is no exception: Antigravity—a rather interesting choice of name. The Need I’ve been working on a Node.js–based orchestration framework for AI agents—yes, yet another one, even though there are already countless options out there. I looked, but none really did what I wanted, so I went ahead and built my own, thanks in part to the helping hands of Claude Code (Anthropic). Bottom line: it works fine. There’s still a lot of manual tweaking required, but I was able to achieve things I couldn’t do with plain ChatGPT or Claude alone. However, it lacks a UI. I didn’t even consider building or using one, because I figured I just neede...

How I Ended Up Creating an AI Playground to Illustrate and Educate

TL;DR AI Playground User Guide