There’s a lot of talk about how MCP isn’t secure, but I think most people don’t realize just how easy it is to trick LLMs. Simon Willison gives a solid overview of the main risks, which he calls the “lethal trifecta”.

The lethal trifecta of capabilities is:

  • Access to your private data — one of the most common purposes of tools in the first place!
  • Exposure to untrusted content — any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM
  • The ability to externally communicate in a way that could be used to steal your data (I often call this “exfiltration” but I’m not confident that term is widely understood.)

The core issue is that LLMs are great at following instructions, but they don’t distinguish between legit ones and malicious ones.

LLMs follow instructions in content. This is what makes them so useful: we can feed them instructions written in human language and they will follow those instructions and do our bidding.

The problem is that they don’t just follow our instructions. They will happily follow any instructions that make it to the model, whether or not they came from their operator or from some other source.

He digs into MCP specifically:

The problem with Model Context Protocol—MCP—is that it encourages users to mix and match tools from different sources that can do different things.

Many of those tools provide access to your private data. Many more of them—often the same tools in fact—provide access to places that might host malicious instructions.

And yeah, there’s no easy fix.

Here’s the really bad news: we still don’t know how to 100% reliably prevent this from happening.

Plenty of vendors will sell you “guardrail” products that claim to be able to detect and prevent these attacks. I am deeply suspicious of these: If you look closely they’ll almost always carry confident claims that they capture “95% of attacks” or similar… but in web application security 95% is very much a failing grade.