Skip to main content

Command Palette

Search for a command to run...

Deep Agents: What LangChain Quietly Built While Everyone Was Arguing About Prompts

Why the simplest agent architecture kept failing โ€” and the four primitives that fix it.

Updated
โ€ข12 min read
Deep Agents: What LangChain Quietly Built While Everyone Was Arguing About Prompts

The "Shallow Agent" Problem Nobody Talks About

If you've been following my previous articles on Prompt Engineering and Context Engineering, you know I've been going deeper and deeper into how LLMs actually work under the hood.

This article is the natural next step. Because here's the thing โ€”

We've all seen the basic agent architecture. An LLM running in a loop, calling tools, getting results, calling more tools. It's elegant. It's simple. And for basic tasks, it works great.

But try giving it a complex task โ€” something like "research the top 5 competitors in the Indian EdTech space, compare their pricing models, and draft a strategy doc" โ€” and watch it fall apart.

Why? Because the agent becomes shallow.

"Shallow" here means the agent can't plan over longer time horizons. It solves whatever is immediately in front of it rather than thinking about the bigger picture. After 15-20 tool calls, the original objective gets buried under thousands of tokens of intermediate results. The agent starts drifting, repeating work, or just... stopping early because it forgot there were more steps.

Sound familiar? It should. This is exactly the Context Distraction problem I wrote about in my Context Engineering article. The context grows so long that the model over-focuses on recent content and loses sight of the original goal.

So, what's the fix?


Enter Deep Agents

LangChain noticed something interesting. Applications like Claude Code, Deep Research, and Manus โ€” the agents that actually work on complex, long-running tasks โ€” they all share four characteristics:

  1. A planning tool

  2. Sub-agents

  3. Access to a file system

  4. A detailed system prompt

That's it. The core algorithm is the same โ€” an LLM calling tools in a loop. The difference is infrastructure around the loop.

LangChain packaged these four primitives into an open-source library called deepagents. They call it an "agent harness" โ€” not a new framework, not a new reasoning paradigm, just an opinionated wrapper that gives your agent the equipment it needs to go deep instead of staying shallow.

Think of it this way:

LangGraph gives you an engine and a transmission. Deep Agents gives you a car.

Install it with one line:

pip install deepagents

And the simplest possible agent looks like this:

from deepagents import create_deep_agent

def get_weather(city: str) -> str:
    """Get weather for a given city."""
    return f"It's always sunny in {city}!"

agent = create_deep_agent(
    model="openai:gpt-4o",
    tools=[get_weather],
    system_prompt="You are a helpful assistant",
)

agent.invoke(
    {"messages": [{"role": "user", "content": "what is the weather in sf"}]}
)

One function. Under the hood, it handles the LangGraph graph, state management, streaming, and context window management โ€” none of which you touched.

But the real magic is in those four primitives. Let's break each one down.


The Four Pillars of Deep Agents

๐Ÿง  Pillar 1: The Planning Tool (The Most Counterintuitive One)

Every deep agent automatically gets a write_todos tool.

Now here's the part that blew my mind when I first learned about it โ€”

This tool is a no-op. It literally does nothing.

When the model calls write_todos(["research X", "compare Y and Z", "draft summary"]), no scheduler runs. No task queue gets populated. No database row is written. The tool just accepts the input and returns something like "todos updated."

So why does it work?

Remember the "Attention Budget" concept from my Context Engineering article? LLMs have no hidden scratchpad, no persistent working memory between tool calls. The context window IS the model's memory. Whatever isn't written down effectively doesn't exist for the model on the next step.

When the agent calls write_todos, three things happen:

  1. The plan gets serialized into tokens. The todo list now physically exists as text in the message history.

  2. Future token predictions attend to it. Every subsequent word the model generates is conditioned on that plan being right there in recent context.

  3. The act of writing forced decomposition. To emit the tool call, the model had to commit to a specific breakdown. That commitment is now anchored.

Without this, on a long task the model drifts. After 30 tool calls and 15,000 tokens of intermediate results, the original objective gets buried. The plan tool prevents this by keeping the goal in the "hot zone" of recent context.

๐Ÿ’ก Harrison Chase (LangChain CEO) specifically credits Claude Code as the inspiration. Claude Code's Todo list tool is also a no-op โ€” it's pure context engineering strategy. Prompting matters still!

This connects directly to what I discussed in the Context Engineering article โ€” every token in the context affects how the model behaves. The plan tool is deliberately injecting high-value tokens that anchor the model's behavior across dozens of turns.

The tool's functional uselessness is the feature, not a bug. A "real" scheduler would couple the agent to external infrastructure. A no-op just shapes the language model's own behavior โ€” which is the only thing that actually matters.


๐Ÿ“ Pillar 2: The Virtual File System (Context Offloading in Action)

Deep agents get built-in tools: ls, read_file, write_file, edit_file.

Now, if you've read my Context Engineering article, this should immediately ring a bell. This is Context Offloading โ€” the strategy where the agent stores information outside the LLM's context window and pulls it back in when needed.

Instead of stuffing 50,000 tokens of research results into the conversation history (hello, Context Distraction ๐Ÿ‘‹), the agent writes intermediate results to files. When it needs that information later, it reads just the relevant file.

The file system is virtual by default โ€” "files" live in agent state, not on your actual disk. But you can swap backends:

  • In-memory โ€” for quick, ephemeral tasks

  • Local disk โ€” for development

  • LangGraph Store โ€” for cross-thread persistence

  • Sandboxes (Modal, Daytona, Deno) โ€” for isolated code execution

When using a sandbox backend, agents also get an execute tool to run shell commands โ€” tests, builds, git operations. That's how the CLI version works as a terminal coding agent comparable to Claude Code.

๐Ÿ’ก The key insight: The file system isn't just storage. It's a context management strategy. Write large results to a file, keep a short summary in context, read the file back only when you need the details. This directly combats Context Confusion and Context Distraction.


๐Ÿค– Pillar 3: Sub-agents (Context Quarantine in Disguise)

A built-in task tool lets the main agent spawn specialized sub-agents. Each sub-agent gets its own clean context window, goes deep on a specific subtask, and returns only a condensed summary.

Again โ€” this is exactly the Context Quarantine strategy from my previous article. Rather than one agent attempting to maintain state across an entire project, specialized sub-agents handle focused tasks with isolated contexts.

Here's how it works mechanically:

  1. The supervisor agent calls task(description="research competitor pricing").

  2. The runtime spins up a sub-agent with a fresh context window.

  3. The sub-agent runs its full tool-calling loop โ€” maybe 20-30 LLM calls, tens of thousands of tokens.

  4. It returns only a condensed summary (1,000โ€“2,000 tokens) to the supervisor.

  5. The supervisor's context stays clean.

The sub-agent might explore extensively, but the supervisor only sees the distilled result. The main context never gets polluted with raw search results, API responses, or intermediate reasoning.

Inline vs. Async Sub-agents

Now, there's an important nuance here that the latest release (v0.5) addresses.

Inline sub-agents block the supervisor. When the supervisor calls task(), its entire execution loop freezes until the sub-agent finishes. For a sub-agent doing deep research โ€” 40 LLM calls, each taking 2-10 seconds, plus tool calls โ€” that's easily 5-15 minutes of wall-clock time where the supervisor can't do anything. Can't respond to the user, can't work on other tasks, can't spawn other sub-agents.

Think of it like a restaurant where the head chef personally goes to the farm to pick vegetables every time an order comes in. The entire kitchen stops.

Async sub-agents fix this. Instead of blocking, start_async_task() returns a task ID immediately. The actual work runs on a separate Agent Protocol server โ€” different process, possibly different machine. The supervisor continues its loop, works on other things, and polls for results via check_async_task(task_id) when it's ready.

Same head chef analogy โ€” now the chef calls the farm, places the order, and keeps cooking other dishes. When the delivery arrives, the chef incorporates the ingredients.

๐Ÿ’ก Rule of thumb: Inline for sub-second to tens-of-seconds work. Async for minutes-plus. Short, focused tasks (classify this input, extract these fields) should stay inline. Long-running research and multi-step pipelines are where async pays off.


๐Ÿ“ Pillar 4: The Detailed System Prompt

This one might seem obvious, but it's more nuanced than you think.

Claude Code's system prompts are long. Really long. They contain:

  • Detailed instructions on how to use each tool

  • Few-shot examples for specific situations

  • Rules about when to plan vs. when to act

  • Guidelines for verifying work before reporting results

Without these prompts, the agents would not be nearly as deep. Prompting matters still!

Deep Agents ships with opinionated defaults inspired by Claude Code's prompt structure. These teach the model to:

  • Plan before acting

  • Verify work after completing it

  • Manage context proactively (write to files, summarize when needed)

  • Use sub-agents for context isolation

You can extend these with custom instructions or replace them entirely. But the defaults are strong โ€” they encode hard-won lessons about what makes agents actually reliable.


How It All Connects to Context Engineering

If you've been reading my articles in order, you might be seeing a pattern emerging. Let me make it explicit:

Context Engineering Strategy Deep Agents Implementation
Compaction / Summarization Auto-summarization middleware compacts older messages when context grows long
Context Offloading Virtual file system โ€” write results to files, read back when needed
Context Quarantine Sub-agents with isolated context windows
Tool Loadout Skills system โ€” reusable bundles of workflows and domain knowledge

Deep Agents is essentially Context Engineering, productized. Every pillar directly addresses one of the context failure modes I wrote about:

  • Context Poisoning โ†’ The plan tool lets the agent self-correct by checking todos against actual progress

  • Context Distraction โ†’ File system offloads intermediate results so they don't dilute attention

  • Context Confusion โ†’ Sub-agents quarantine irrelevant context from the main thread

  • Context Clash โ†’ Permission rules and structured tools prevent conflicting information from accumulating

This is why I got so excited about this library. It's not just another framework. It's context engineering principles turned into reusable infrastructure.


When to Use What โ€” The LangChain Stack

LangChain now has three tiers. Choosing correctly matters:

LangChain (create_agent) โ€” For simple agents and standardized team patterns. The tool-calling loop is enough. Think: a customer service bot that looks up order status.

LangGraph โ€” The low-level runtime. For when you need full control over state, conditional edges, and custom graph topology. Think: a complex approval workflow with branching logic you need to define precisely.

Deep Agents โ€” For complex, non-deterministic, long-running tasks where you want planning, filesystem, sub-agents, and context compaction out of the box. Think: a research agent that explores a topic for 30 minutes and produces a comprehensive report.

For simple Q&A or single-tool tasks, a basic agent is fine. Deep Agents shine when the task feels more like a project than a question.


Quick Start โ€” Building Your First Deep Agent

Here's a minimal but real example using Tavily for web search:

import os
from langchain.chat_models import init_chat_model
from langchain_core.tools import tool
from deepagents import create_deep_agent
from tavily import TavilyClient

os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["TAVILY_API_KEY"] = "your-key"

tavily = TavilyClient()

@tool
def web_search(query: str) -> str:
    """Search the web for current information."""
    results = tavily.search(query, max_results=3)
    return "\n".join([r["content"] for r in results["results"]])

model = init_chat_model("openai:gpt-4o")

agent = create_deep_agent(
    model=model,
    tools=[web_search],
    system_prompt="You are a research assistant. Always plan before acting.",
)

result = agent.invoke({
    "messages": [
        {"role": "user", "content": "Research the current state of AI agents in 2025 and write a summary"}
    ]
})

The agent will automatically:

  1. Create a plan using write_todos

  2. Search the web using your tool

  3. Write intermediate findings to its virtual file system

  4. Synthesize everything into a final summary

  5. Check off todos as it goes

All of that behavior comes from the harness โ€” you didn't code any of it.


Things I Wish Someone Told Me Earlier

After spending considerable time understanding this library, here are my honest takeaways:

1. The plan tool is psychological, not functional. Don't expect write_todos to do scheduling. It shapes the model's behavior by making planning explicit in context. That's it. And that's enough.

2. Sub-agents aren't free. Each spawn is another full LLM call stack. Use them for genuine context isolation, not just to make your architecture look fancy.

3. Model choice changes behavior significantly. Some models plan well but execute tool calls unreliably. Some are the opposite. Benchmark before committing. Deep Agents is provider-agnostic โ€” try GPT-4o, Claude, Qwen, Llama and compare.

4. The file system is virtual by default. "Files" live in agent state unless you configure a durable backend. Don't assume writes persist across threads without explicit setup.

5. Prompting still matters. A lot. The defaults are strong but opinionated toward coding/research work. Domain-specific agents need prompt customization.


Managing context is often the toughest part of creating an agent. Deep Agents doesn't eliminate that challenge โ€” but it gives you principled tools to handle it, so you can focus on building what actually matters.


Sources


#deep-agents #langchain #context-engineering #gen-ai #agents #langgraph