MCP Performance Optimization: Reduce Latency & Token Usage
Caching, streaming, tool selection, and context pruning techniques that cut MCP latency by up to 70%. Real benchmark numbers and production-ready code.
TL;DR
- Cache tool results aggressively — repeated reads are the biggest latency killer
- Use streaming responses (SSE transport) for tools that return large payloads
- Keep tool schemas lean: verbose descriptions cost tokens every request
- Batch multiple reads into one tool call instead of chaining N sequential calls
- TypeScript MCP servers cold-start ~80ms faster than Python equivalents
- Prune context between turns — only pass what the model needs for the current step
Why MCP Performance Matters
Every MCP tool call adds latency to the user experience. In a typical Claude Desktop session, the model may invoke 5–15 tools per task. If each call takes 300ms, that is 1.5–4.5 seconds of waiting before the model can reason about the results. Multiply that across thousands of users and the cost compounds fast — both in wall-clock time and in API token spend.
The good news: most performance problems in MCP integrations come from a handful of preventable patterns. This guide walks through each one with benchmarks and fixes.
Benchmark Baseline
All numbers below were measured on a MacBook Pro M3 (local stdio transport) and an AWS t3.medium (SSE transport, us-east-1) running Claude claude-sonnet-4-6 via the Anthropic API.
| Scenario | Before | After | Improvement |
|---|---|---|---|
| File read (no cache) | 320ms | 8ms | 97% faster |
| DB query (no cache) | 480ms | 12ms | 97% faster |
| 5 sequential reads | 1600ms | 490ms | 69% faster |
| Large payload (no stream) | 2200ms | 820ms TTFB | 63% faster TTFB |
| Python cold start | 310ms | — | baseline |
| TypeScript cold start | 230ms | — | 26% faster |
1. Caching Tool Results
The single biggest performance win in most MCP deployments is caching. When an AI session reads the same file, queries the same database row, or fetches the same API endpoint multiple times within a conversation, you are paying the full round-trip cost every time. A simple in-memory cache with TTL cuts that to near zero.
TypeScript — LRU cache for MCP tools
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { LRUCache } from "lru-cache";
const server = new McpServer({ name: "cached-server", version: "1.0.0" });
// TTL cache: 5 minutes, max 500 entries
const cache = new LRUCache<string, string>({
max: 500,
ttl: 1000 * 60 * 5,
});
server.tool(
"read_document",
{ id: { type: "string", description: "Document ID" } },
async ({ id }) => {
const cacheKey = `doc:${id}`;
const cached = cache.get(cacheKey);
if (cached) {
return { content: [{ type: "text", text: cached }] };
}
// Expensive DB fetch only on cache miss
const doc = await db.documents.findById(id);
const text = JSON.stringify(doc);
cache.set(cacheKey, text);
return { content: [{ type: "text", text }] };
}
);For mutable data, key your cache entries with a content hash or entity version number rather than a fixed TTL. This way, a write to the database immediately invalidates the relevant cache entry.
2. Streaming Responses with SSE Transport
The default stdio transport buffers the entire tool response before returning it to the model. For tools that return large payloads — log files, long documents, search results — this means the model sits idle until the full response is assembled. Switch to SSE (Server-Sent Events) transport and stream the content progressively.
TypeScript — SSE transport setup
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { SSEServerTransport } from "@modelcontextprotocol/sdk/server/sse.js";
import express from "express";
const app = express();
const server = new McpServer({ name: "streaming-server", version: "1.0.0" });
app.get("/sse", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
const transport = new SSEServerTransport("/messages", res);
await server.connect(transport);
});
app.post("/messages", express.json(), async (req, res) => {
// SSE transport handles routing internally
await transport.handlePostMessage(req, res);
});
app.listen(3000);3. Tool Schema Optimization
Every MCP session sends the full tools/list payload to the model before each turn. Verbose descriptions, redundant parameter explanations, and deeply nested input schemas all inflate your token count — and therefore your inference cost — on every single request.
| Pattern | Tokens / Tool | Recommendation |
|---|---|---|
| Verbose description (200+ words) | ~300 | Trim to 1–2 sentences |
| Nested object params (3+ levels) | ~180 | Flatten to scalar params |
| Enum with 20+ values | ~120 | Use string + validate server-side |
| Concise description (1–2 sentences) | ~40 | Target this range |
4. Batching Sequential Calls
A common anti-pattern is exposing fine-grained tools that force the model to make N sequential calls to accomplish what could be one batched call. If you have a filesystem server, for instance, add a read_multiple_files tool alongside read_file. The model will use it.
TypeScript — batch read tool
server.tool(
"read_multiple_files",
{
paths: {
type: "array",
items: { type: "string" },
description: "File paths to read (max 20)",
maxItems: 20,
},
},
async ({ paths }) => {
// Read all files in parallel — not sequential
const results = await Promise.all(
paths.map(async (p) => {
try {
const content = await fs.readFile(p, "utf8");
return `=== ${p} ===\n${content}`;
} catch (err) {
return `=== ${p} === ERROR: ${(err as Error).message}`;
}
})
);
return {
content: [{ type: "text", text: results.join("\n\n") }],
};
}
);5. Context Pruning
MCP tool results accumulate in the conversation context. After a tool returns a 200-line JSON blob, that entire blob is re-sent to the model on every subsequent turn. Design your tools to return only what the model needs for the next reasoning step, not the full raw API response.
- Return summaries when possible: instead of a 500-line log file, return the last 20 error lines
- Filter API responses server-side before returning to the model
- Use pagination: expose a
pageparam and return 10–20 items at a time - Strip metadata fields the model does not need (internal IDs, audit timestamps, etc.)
6. Python vs TypeScript Performance
Both the official Python SDK (mcp) and TypeScript SDK (@modelcontextprotocol/sdk) are production-ready. The performance differences are real but smaller than most engineers expect:
- Cold start: TypeScript (Node.js) starts ~80ms faster than Python. For stdio servers that Claude Desktop restarts per session, this adds up.
- Throughput: For I/O-bound tools (HTTP calls, DB queries), the difference is negligible — both spend most time waiting on the network.
- Memory: Python uses ~15MB more RSS at idle due to the interpreter overhead.
- CPU-bound tools: TypeScript has the edge for pure computation; Python wins if you need NumPy, Pandas, or ML libraries.
The practical recommendation: choose the language your team knows best. The performance gap rarely justifies a rewrite.
7. Connection Keep-Alive for SSE
When using SSE transport, avoid tearing down and re-establishing the connection between turns. Keep-alive connections eliminate TCP handshake overhead (~50–120ms per request depending on geography). Set a heartbeat ping to prevent proxies and load balancers from closing idle connections.
SSE heartbeat (Node.js)
// Send a comment-line ping every 30s to keep the connection alive
setInterval(() => {
res.write(": ping\n\n");
}, 30_000);Quick-Win Checklist
PERFORMANCE CHECKLIST
Promise.all() for parallel I/O inside a single tool call