MCP + Context: engineering for the context

MCP + Context: engineering for the context – hard lessons learned

Intro

I have built my own orchestration framework because most of what I’ve seen was too complex or tried to lock you into creating workflows a certain way. I wanted something very simple and yet maximally flexible. I’m not going into details here on the framework — that’s another blog post — but I will in some cases explain why I could do what I did thanks to the flexibility of the framework, which is a dynamic DAG, can do call-backs, and uses functions and MCP servers. I will also not explain in detail what I’m doing with my current workflow, other than to say I was looking for a way to bypass large language models and instead run it on my own system at home. I succeeded with that — but that’s another blog post. Instead, what I will try to explain in this post is the most important thing after prompt engineering: context engineering, and why it’s so crucial to manage that aspect (especially when you run this at home).

Stage setting

A couple of weeks ago, Anthropic posted this: https://www.anthropic.com/engineering/code-execution-with-mcp (which itself is actually based on an idea by Cloudflare) — in which they make the solid observation that if you overload your context window with MCP functions, you have fewer tokens left for the actual task at hand, and intermittent tool results further decrease the amount of tokens left in the context window. I read it and thought “duh!” — and it made me think, either I’m much smarter than most (unlikely), or people just don’t understand MCP servers and the context window (more likely). I guess the second revelation is the relevant one. But here is the thing: it’s also a form of ideology, because when Anthropic talks about the use of MCP servers, they primarily look at this from the point of view that you are a user chatting with their LLMs, and because they overloaded on the MCP servers, the LLMs can “do anything” (which is of course not true). Then there are the pragmatists like myself, who see the evolution of LLMs and tool calling as an excellent technology to automate pre-defined workflows to the next level. It’s two worlds — but I think my world is the more relevant in our existing setup where we already have pre-defined processes in place. The article is only correct if you don’t have a predefined process.

A rather long introduction to lay down this concept about predefined workflows and how to go about agentic AI. Here are my learnings:

1. MCP servers

MCP servers are great — especially well-engineered MCP servers — because they already understand that in order to get the most out of the context window and the LLM, they have to focus on what to return and limit the information so it’s only relevant to the context. In other words: if your MCP server dumps 10,000 tokens into your request, then you have a problem. I’m using the excellent Playwright MCP server by Microsoft which, for example, doesn’t return HTML; it returns very light YAML. But the other important aspect is what the Anthropic post mentioned earlier highlights: you only have to supply the tools that you know your agent will require. For example, if all you want to do is a database query, you only provide the “select from table” tool and nothing else. I can specify that in my orchestration workflow. But sometimes this is still too large.

2. Tool-call callback

I have looked for ways to reduce the amount of information coming back from a tool call, because in certain situations it’s still too much information or not relevant information. My agent structure allows for call-backs when doing tool-calling. The idea here is that before sending the tool result back to the LLM in the next request, you reduce the tool result’s length.

Two examples:

a) Let’s say you have an agentic flow where you visit various websites and sometimes you visit the same site again because it’s interlinked. In my callback function, I keep an array of websites visited and when I see that a website has already been visited, I return this as the tool-call content: “ALREADY VISITED.” That’s 3–4 tokens instead of sending redundant context.

b) Staying with the web example, assuming your MCP server returns HTML: do you really need all that HTML in your context window? You don’t, because all you really want are either the URLs for follow-up crawling or the plain text. Which means: just send back the URLs, or just send the plain text back. But in some cases, even that is too much context.

3. Compression (or summarisation)

Let’s stay with the web example. Assuming the tool call returned a 10,000-character HTML document, you strip out the HTML and you’re left with 7,000 characters. Now, if it’s just the semantics of the document that matter, you can send that through another LLM that is very efficient at precisely capturing the meaning of those 7,000 characters before sending it back into your agentic flow. I’ve had compression results of over 90% doing this — and in the end, the process came to the same result as without compression. It works. But sometimes, you really need to find a way to match semantics to other documents.

4. Vectorisation (or vector search)

Let’s assume that the tool-call result needs to be matched as closely as possible to some other document in your agentic flow. Well — you could vectorise the result. Basically, you could do RAG (Retrieval Augmented Generation), search for the relevant document, and return that content.

5. Call MCP servers directly

Especially for the last example (but it works anywhere): MCP servers can be called directly — precisely because well-engineered ones are context-sensitive.

I have a very good (hard-learned) example of this: I’ve compared website crawling using a traditional agentic workflow (with all my tricks) versus crawling by calling the MCP server directly. I can do that in my orchestration framework by simply defining an agent that only knows one function to execute. In other words: I’ve tested this with my function agent and my agentic agent. The URL is passed the exact same way between the two, as is the result, so they are interchangeable. And indeed, it was much more efficient to just do a loop and crawl the website instead of doing this via an agentic flow. This should not come as a surprise, and it may simply not work for everyone, because if you just crawl a website, you don’t need the agentic flow — but if you want to do very specific things on a website (like filling in a form, etc.), then this wouldn’t work.

Lesson learned

What’s the lesson here? After spending a good week perfecting my workflow so I can run it on my aging M1 Ultra with 64GB (I’m using LM Studio with qwen3-next-80b), it’s all about context engineering. Prompt engineering is still the most important aspect — but as it turns out, the model itself is not as important if you engineer your context well.

The case for central bank digital currency as public infrastructure to enable digital assets

I have dabbled a fair amount with all sorts of crypto currencies and their respective permissionless networks. In fact, I have been dabbling since 2012 which is by my last count a whopping 12 years. While I have always maintained that I do believe the general concept for digitization and programmability of assets is on the right path, its implementation, the user experience, the accessibility, the fraudulent activities, and the overall inefficiencies permissionless DLTs have, never made me into a true believer. I have stated that opinion on several occasions, here , here and here . There are still barriers to entry when it comes to digitization of assets: sustainable- and interoperable infrastructure. To illustrate this, I recently asked a notary public here in Zurich, why they can’t store the notarized documents as PDFs, the answer surprised me: because they must keep records for at least 70 years. Now, think about what would have happened if we stored these documents on floppy disks...

Digital Wire: Digital business, digital transformation, APIs, AI and payments

Search This Blog