返回列表
🧠 阿头学 · 💬 讨论题

生产级智能体的关键不是模型,而是运行时

这篇文章判断准确地抓住了“智能体上线后真正难的是 runtime 而不是 demo”,但它同时是一篇明显把 LangSmith 方案包装成行业标准答案的技术营销文。
打开原文 ↗

2026-04-21 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • runtime 才是生产分水岭 文章最站得住的判断是,智能体一旦进入真实环境,就不再是“提示词+工具”的问题,而是持久执行、暂停恢复、并发处理、权限隔离、可观测性这些运行时问题;如果没有这些能力,所谓 agent 很快就会退化成脆弱的聊天脚本。
  • 持久执行是底座,不是附加项 作者对 checkpoint、thread_id、interrupt/resume 的强调是对的,因为长任务、人工审批、跨天等待、worker 崩溃恢复,本质上都要求系统能跨进程保存状态;这一点不是体验优化,而是生产可用性的硬门槛。
  • guardrails 必须写进执行层 文中最有普适价值的判断是,PII 脱敏、工具调用上限、重试、fallback 不能靠提示词“提醒模型遵守”,而要靠 middleware 在 runtime 内部确定性执行;这条原则不依赖 LangSmith,放到任何 agent 框架都成立。
  • 实时交互问题被讲清楚了 对 streaming、double-texting 的拆解有很强工程价值,尤其是 enqueue / reject / interrupt / rollback 四种策略,把一个常被忽视的聊天产品问题说成了明确的系统设计选择,而不是交给前端临场补锅。
  • “开放、无锁定”说得过头了 文章一边强调开源 harness、开放协议、Postgres 可控,一边又把 tracing、deployment、auth、studio、evals、sandbox 等核心能力深绑到 LangSmith 体系里;这不是没有锁定,而是把锁定从模型层转移到了基础设施层。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 这篇文章最适合 ATou 拿来重估“什么算真正的 agent 产品”:不要再只看模型和工具数量,而要优先审视持久执行、HITL、并发消息处理和权限控制;下一步可以把现有 agent 项目按“执行层/状态层/控制层/改进层”做一次缺口盘点。
  • 对 Neta 意味着什么、下一步怎么用 对 Neta 来说,这篇文提供了一个判断 agent 基础设施投资价值的框架:真正有壁垒的可能不是 prompt 层,而是 runtime 像操作系统一样的能力栈;下一步应重点比较各家方案的迁移成本、运维复杂度和平台锁定程度,而不是只看 demo 效果。
  • 对 Uota 意味着什么、下一步怎么用 对 Uota 来说,文中关于“沙箱保护宿主机,不保护沙箱内部”的判断非常关键,说明安全设计不能停留在“我有 sandbox”这种表层安心;下一步应该把凭证代理、最小权限、恶意输入下的动作约束作为安全评审的必查项。
  • 对三者共同意味着什么、下一步怎么用 这篇文章可以直接作为讨论智能体产品成熟度的共用清单,但不能把它当成 LangSmith 的宣传册照单全收;下一步最好把“哪些能力是必需品,哪些只是这家产品的实现偏好”明确拆开。

讨论引子

1. 生产级 agent 的核心壁垒,究竟会沉淀在模型能力上,还是 runtime 这类“agent OS”能力上? 2. 如果一家平台同时提供 deployment、tracing、memory、sandbox、evals,它是在减少工程负担,还是在加深基础设施锁定? 3. 对不同场景来说,double-texting 应该默认 enqueue、interrupt 还是 rollback,背后对应的产品哲学分别是什么?

要构建一个好的智能体,你需要一个好的 harness。要部署这个智能体,你还需要一个好的 runtime。

harness 是你围绕模型搭建的整套系统,用来帮助智能体在其领域中取得成功。其中包括提示词、工具、技能,以及任何支撑模型与工具调用循环的其他要素,而这个循环正是智能体的定义。runtime 则是更底层的一切,包括持久执行、记忆、多租户、可观测性,以及那套让智能体能在生产环境持续运行、而无需你的团队从头重复造轮子的机制。

本指南会带你了解,智能体一旦部署后会浮现出哪些生产环境需求,哪些运行时能力可以满足这些需求,以及 deepagents deploy 如何把这些能力打包成真正可以交付上线的东西。

生产级智能体所需的运行时能力

在本节中,runtime 指的是 LangSmith Deployment(LSD)及其 Agent Server。LSD 负责让智能体在生产环境中运行,而 Agent Server 则是 assistants、threads、runs、memory 和 scheduled jobs 的接口。下表将每一项生产需求与能够满足它的运行时原语对应起来。

https://docs.langchain.com/langsmith/custom-store

持久执行

智能体的工作方式,是运行一个循环。给定一个提示,模型会进行推理、调用工具、观察结果,然后不断重复,直到它判断任务已经完成。

https://docs.langchain.com/langsmith/deploy-with-control-plane

不同于通常会在毫秒内返回的 Web 请求,这个循环可能持续数分钟甚至数小时。一次运行可能发起几十次模型调用,生成子智能体,或者无限期地等待人工批准一份草稿。在这个循环中的任何位置发生崩溃、部署变更或瞬时故障,都不该抹掉之前已经完成的工作。

在实际中,这种需求会集中体现在两个地方。

长时间运行必须能扛住基础设施故障。 一个研究型智能体如果花了二十分钟收集资料、综合结论,那么一旦 worker 进程挂掉,它就不能从头再来,因为 token 已经花掉了,工具调用也已经执行过了。理想情况是,它能从最后一个完成的步骤继续恢复,并保留此前的全部状态。

智能体必须能够真正停下来等待。 一个在等待人工批准交易的智能体,并不知道人会在三十秒后回复,还是三天后回复。在这整个等待窗口里一直占着 worker 进程或客户端连接,是不可行的。智能体需要真正停住,释放资源、交还 worker,然后在之后精确地从原地继续。

这两个需求,本质上都由同一件事解决,那就是持久执行。

  • 智能体运行在受管任务队列上,并带有自动 checkpoint,因此任何一次运行都可以从中断点精确地重试、重放或恢复。

  • 图执行的每个 super-step 都会向持久层写入一个 checkpoint,默认写入 PostgreSQL,并以 thread_id 作为键。这个 thread_id 就像指向该次运行的持久游标。

  • 当某个 worker 崩溃时,这次运行的 lease 会被释放,并由另一个 worker 从最新 checkpoint 接手。

  • 当智能体等待人工输入时,进程会交出自己的执行槽位,而这次运行则会无限期休眠,直到被恢复。

  • 可配置的重试策略可以按节点控制退避方式、最大尝试次数,以及哪些异常会触发重试。

持久性是本列表其余能力赖以成立的基础。正因为执行可以跨进程边界暂停和恢复,智能体才能无限期等待人工输入、在后台运行、在运行中途经历部署变更而不丢状态,并在并发输入下不破坏状态。

记忆

智能体需要两种不同的记忆,而且这一区分非常重要。

短期记忆 是智能体在单次对话 内部 累积起来的内容。包括来回交换的消息、已经发起的工具调用、以及一次运行过程中逐步构建的中间状态。这些内容存在该线程的 checkpoint 里,作用域限定在某个 thread_id 上,并且在概念上会随着这段对话结束而结束。同一线程上的后续消息,能够看到此前在这个线程里发生过的一切。

长期记忆 是智能体在 跨对话 场景中持续携带的内容。它可能包括跨多次对话学到的用户偏好、项目约定与最佳实践,或者随着每次新查询不断增强的知识库。这些内容都不属于某一个单独的线程,而是用户级或组织级上下文,应该在智能体参与的每一段对话之间持续存在。仅靠 checkpoint 做不到这一点,因为 checkpoint 的状态只绑定单个线程。

长期记忆正是 Agent Server 内置 store 的用途所在。它是一个键值接口,记忆按照 namespace tuple 组织,例如 (user_id, "memories"),并能够跨线程持久保存。你的智能体可以在一次对话中写入 store,在下一次对话中再读出来。它默认由 PostgreSQL 支撑,并通过 embedding 配置支持语义搜索,因此智能体可以按语义而非精确匹配来检索记忆。如果你需要不同的存储特性,也可以替换为自定义后端。namespace 结构本身也很灵活,可以按用户、assistant、组织,或任何适合你数据模型的组合进行划分。

因为那些经过数月积累下来的记忆,往往是系统产出的最有价值的数据之一,所以它存放在哪里,非常重要。这个 store 可以直接通过 API 查询;如果你是自托管,它就存在你自己的 PostgreSQL 实例里。把这些数据保存在你能控制的标准格式中,才能让你在不同模型之间迁移、对其进行分析,或在智能体之外继续利用它。

多租户

智能体一旦开始为不止一个用户提供服务,就会立刻冒出一系列在单人模式下根本不存在的问题。这些问题可以拆成三个不同层面,而 Agent Server 为每一层都提供了独立的原语。

将一个用户的数据与另一个用户隔离开。 用户 A 的运行,只能访问用户 A 的线程,也只能读取用户 A 的记忆。自定义认证会作为 middleware 运行在每一个请求上。你的 @auth.authenticate handler 会校验传入凭证,并返回用户的身份与权限,这些信息随后会附加到运行上下文里。注册在 @auth.on.threads@auth.on.assistants.create 等位置的授权处理器,则会在资源创建时打上归属元数据,并在读取时返回过滤字典,从而决定谁可以查看或修改哪些内容。handler 的匹配顺序从最具体到最通用,因此你可以先从一个全局 handler 起步,再随着模型演进逐步加入更细分的资源级 handler。

让智能体代表用户行动。 智能体经常需要使用用户自己的凭证去调用第三方服务,比如读取 他们的 日历、向 他们的 Slack 发消息,或者在 他们的 仓库里开一个 PR。Agent Auth 负责处理这一模式中的 OAuth 流程和 token 存储,因此在运行时,智能体可以拿到用户作用域下的凭证,而无需你自己管理 refresh 流程。用户只需认证一次,智能体就可以在之后的多次运行中持续代表他们行动。

控制谁能够操作系统本身。 除了终端用户访问之外,还存在另一个问题,那就是 你自己团队 里的哪些成员可以部署智能体、配置它们、查看 trace,或修改认证策略。RBAC 处理的就是这一层面的操作者访问控制。

这三层可以自然组合起来。终端用户通过你的 auth handler 完成认证,智能体通过 Agent Auth 调用第三方服务,而你的团队则在 RBAC 策略之下操作整个部署系统。

https://docs.langchain.com/oss/python/langchain/middleware

Human-in-the-loop(HITL)

智能体的工作方式,是运行一个循环。给定一个提示,模型会推理并决定调用工具,观察结果,然后重复,直到它判断自己已经完成当前任务。多数时候,你希望这个循环不中断地跑下去,这正是它产生价值的地方。但有时,你需要在这个循环的关键决策点,把人放进中间。

常见有两种情况。

  1. 审查一项拟执行的工具调用。 在智能体执行某个后果重大的动作之前,比如发送邮件、执行金融交易、删除文件,你希望有人能准确看到它即将做什么,并决定如何回应。拿发送邮件来说,智能体先起草消息,然后在真正发出前暂停。你可以原样批准,也可以在发出前修改主题或正文,或者附上理由与具体修改要求后拒绝,让智能体重新修改再试一次。

  2. 智能体主动提出澄清问题。 有时智能体会走到一个它无法自行解决的决策点,这不是因为缺工具,而是因为正确答案依赖人的判断或偏好。与其猜测,它可以直接把问题抛出来,比如 我找到了三个符合该模式的配置文件。应该修改哪一个。或者 这次部署应该发到 staging 还是 production。你的回答会成为这次 interrupt 的返回值,而智能体会从它停下来的地方继续往下走。

Agent Server 用两个原语来处理这件事。interrupt() 用来暂停执行,并把一个 payload 暴露给调用方;Command(resume=...) 则带着人的回应继续执行。它们配合起来,可以构建审批闸门、草稿审阅循环、输入校验,以及任何需要人在执行中途参与判断的工作流。

https://docs.langchain.com/langsmith/sandbox-warm-pools

在底层,interrupt() 会触发运行时的 checkpointer,把完整的图状态写入持久存储,并以 thread_id 作为持久游标键。随后进程释放资源并无限期等待。不同于那些会在特定节点前后暂停的静态断点,interrupt() 是动态的。你可以把它放在代码里的任何位置,可以放在条件分支中,也可以直接嵌进某个工具函数里,让审批逻辑跟着工具一起走。当 Command(resume=...) 在几分钟、几小时甚至几天后到达时,这个 resume 值就会成为 interrupt() 调用的返回值,而执行会从停下来的原地继续。由于 resume 可以接受任何可 JSON 序列化的值,因此响应不局限于批准或拒绝。审阅者可以返回一份修改过的草稿,人可以补充缺失的上下文,下游系统也可以注入计算结果。当并行分支中各自调用 interrupt() 时,所有待处理的 interrupt 会一起暴露出来,并且可以在一次调用中统一恢复,也可以随着回应陆续回来逐个恢复。

实时交互

Human-in-the-loop 是一种交互模式,它允许执行暂停下来,等待人来审阅或提供输入,有时是立刻,有时则会晚很多。除此之外,还有另一类实时问题,会在用户在线、而智能体正在主动工作时出现,也就是让进度可见的 streaming,以及协调并发消息的 double-texting。

Streaming

如果一个智能体要花三十秒才能给出回答,用户这三十秒里只能盯着一个转圈,不知道它到底是在推进、卡住了,还是马上就要失败。而且在完整回答结束之前,用户也无法开始阅读。streaming 同时解决了这两个问题。智能体在生成内容时,部分输出会持续流向客户端,因此用户能实时看到回答逐步成形。

Streaming API 支持多种模式,取决于你希望看到多细的粒度。你可以在每个图步骤后获取完整状态快照,也可以只获取状态更新,只获取逐 token 的 LLM 输出,或者获取自定义应用事件。它们也可以组合使用。运行级 streaming,也就是 client.runs.stream(),作用域是单次 run;线程级 streaming,也就是 client.threads.joinStream(),则会打开一个长连接,持续接收某个线程上所有 run 产生的事件。当后续消息、后台运行或 HITL 恢复都在同一个线程上触发活动时,这种模式就很有用。

线程级 streaming 支持通过 Last-Event-ID header 恢复。客户端会携带它收到的最后一个事件 ID 重新连接,服务器则会从那个位置继续回放,不会留下空档。如果没有这个机制,每次连接中断都会导致客户端要么漏掉输出,要么只能从头开始。

Double-texting

第二类实时问题是,智能体还在处理上一条消息时,用户又发来了一条新消息。这在聊天 UI 中几乎时时都在发生。有人先发出一个问题,随后意识到自己想说的略有不同,于是在第一轮运行结束前又补发了一条修正。我们把这叫作 double-texting,而运行时必须明确决定该怎么处理。

这里有四种策略,具体哪种合适,取决于你的应用。

  • enqueue(默认):新输入等待当前运行结束后,再按顺序处理。

  • reject:在当前运行完成之前,拒绝所有新输入。

  • interrupt:中止当前运行,保留已有进展,并基于该状态处理新输入。这适合第二条消息建立在第一条消息之上的情况。

  • rollback:中止当前运行,撤销所有进展,包括原始输入,然后把新消息当作一次全新运行来处理。这适合第二条消息直接替代第一条消息的情况。

https://docs.langchain.com/oss/python/langchain/middleware/built-in

interrupt 能带来最灵敏的聊天体验,但它要求你的图能干净地处理部分完成的工具调用,也就是说,某个工具调用在中断发生时已经启动但尚未完成,恢复时可能需要清理。enqueue 则是最稳妥的默认值,不会破坏状态,代价是用户必须等待。

防护栏

并不是所有生产环境问题,都能简单归结为 让循环持久运行。有些问题必须直接塑造循环本身,比如拦截模型输入、过滤工具输出、为高成本操作设置上限。这些策略应该写在代码里,而不是塞进提示词里。它们必须每一次都执行,而不是靠模型偶尔记得去遵守。

两个例子能把这件事讲得很具体。

  1. 在模型看到敏感数据之前就先做脱敏。 一个客服智能体会处理包含个人敏感信息的用户消息,比如姓名、邮箱、账号号码。你不希望模型看到这些内容,也不希望它们出现在 trace 里,而且合规要求通常也会要求在日志记录前先完成脱敏。这必须在每一次模型调用前,以确定性的方式发生。

  2. 为高成本操作设置硬上限。 一个能够调用付费外部 API 的智能体,必须对每次运行中的调用次数设定一个硬性上限,否则模型一旦困惑起来,很可能会愉快地调用五十次,在中午之前就把你的预算烧光。

这两类问题都由 middleware 处理。middleware 会在既定 hook 上包裹整个智能体循环,比如 before_modelwrap_model_callwrap_tool_callafter_model,从而保证这些策略会在每一个相关步骤周围被确定性地执行。

https://docs.langchain.com/oss/python/langchain/middleware/built-in

LangChain 已经内置了覆盖常见需求的 middleware,比如 PIIRedactionMiddleware、ModelRetryMiddleware、ModelFallbackMiddleware、ToolCallLimitMiddleware、SummarizationMiddleware、HumanInTheLoopMiddleware、OpenAIModerationMiddleware。你也可以针对具体应用策略编写自定义 middleware。

middleware 是开源的,但只有当它运行在智能体 runtime 内部 时,它的价值才真正发挥出来。一旦如此,这些相同的策略就会自然成为 runtime 支持的每一种交互模式的一部分,无论是 streaming、human-in-the-loop 的暂停与恢复、重试、后台运行,还是长生命周期线程。在实际里,这意味着你的 guardrails 和埋点不再只是 尽力而为,而是会在你预期的确切位置,稳定地包裹住每一次模型调用和每一次工具调用,不管智能体当时正在做什么。

可观测性

在真正把智能体跑到生产环境里之前,你并不知道它会做什么。传统应用的行为通常还能从代码中推演出来,但智能体的执行路径依赖的是模型在运行时做出的选择,比如调用哪些工具、传入什么参数、如何解释结果、什么时候放弃并换个办法。当事情出错时,你不能只是重新读一遍函数代码。你必须看到实际发生了什么。

比如有一张支持工单写着,智能体一直在重复问同一个问题。如果没有 trace,你只能根据用户的描述猜。可一旦有了 trace,你就能看到完整的执行树,包括用户的消息、模型原本计划给出的响应、它调用的工具、工具返回的结果、它随后生成的下一条消息,以及它最后陷入的那个循环。你还可以按成本过滤,找出哪些运行烧掉了大量 token;按错误过滤,找出哪些运行失败了;按用户过滤,看看某个具体客户经历了什么。你甚至能在成千上万次运行中看出单条 trace 根本无法暴露的模式。

每一个 LangSmith Deployment 都会自动连接到一个 tracing 项目。你开箱就能拿到完整的执行树,包括模型调用、工具调用、子智能体运行、middleware hook,以及可以按用户、时间窗口、成本、延迟、错误状态、反馈或自定义标签查询的结构化元数据。

trace 不只是调试工具,它还是改进闭环的基础。

https://docs.langchain.com/langsmith/human-in-the-loop-time-travel

LangSmith 的 AI 助手 Polly 会分析 trace,并给出洞察,比如常见失败模式、缓慢的工具调用、反复出现的模式,这样你就不用手动去读成千上万条。Online Evals 会自动对生产 trace 跑 LLM-as-judge 或自定义评分器,因此回归问题能够在出现时就被发现。我们正是用这套闭环,只改动 harness,就把 Deep Agents 在 Terminal Bench 2.0 上的成绩提升了 13.7 分。关于为什么智能体改进循环要从 trace 开始,这整套论证本身也很值得完整读一遍。

时间旅行

可观测性会告诉你发生了什么。时间旅行则让你追问,如果当时有某件事不一样,会发生什么

最典型的场景,是调试一条跑偏了的运行。你的智能体在一次 20 步的运行中,于第 5 步做出了错误决策。它可能调用了错误的工具,误读了工具结果,或者在本该继续执行时反而提出了澄清问题。你想知道为什么,也想尝试其他分支,但又不想把整段流程从头重跑。更一般地说,只要智能体的路径依赖于某个 checkpoint 上的状态,你就会希望能够回退到那个 checkpoint,修改状态,然后观察剩余执行如何沿着不同路径展开。

由于每个 super-step 都会写入一个 checkpoint,因此运行历史上的每一个点,本来就是一个可以返回的快照。时间旅行只是把这件事显式化。你从某个线程的历史中选一个 checkpoint,可选地修改它的状态,然后从那里恢复继续。修改后的 checkpoint 会分叉出该线程的历史。原始历史保持不变,而新的路径则作为自己的分支向前运行。LLM 调用、工具调用以及 interrupt 都会在重放中重新触发,因此这些分叉跑的是真实的智能体循环,而不是某个替身版本。

这会解锁一些否则很难搭建的模式,比如调试为什么智能体选了工具 A 而不是本该选择的工具 B,在完全相同的上游上下文下比较两份提示词,从一次已经跑偏的运行中回退到最后一个健康状态重新继续,或者在多个分叉上探索反事实,以理解模型行为。LangSmith Studio UI 为这一切提供了可视化界面,而在大多数生产调试工作流中,大家最终更常用的是 API。

代码执行

一个只能调用你预先接好线的工具的智能体,能力天然受限于你的预判。一个能够运行任意代码的智能体,则是通用型的。它可以安装依赖、克隆仓库、执行测试、做数据分析、生成文档、渲染图表。这正是 具备函数调用的聊天机器人 和 真正能做事的智能体 之间的差距。

任意代码执行要求隔离。如果智能体在你的宿主机上运行 rm -rf /,那你的日子就不好过了。如果它读到了你的环境变量,它就能把 API key 外传。在智能体写出第一条命令之前,你就必须先在它的执行环境与你所重视的一切之间,建立一道边界。

在 Deep Agents 里,这种隔离是通过 sandbox backend 实现的。当你配置了一个实现 SandboxBackendProtocol 的 backend 时,智能体就会自动获得一个 execute 工具,用于在沙箱中运行 shell 命令,同时仍可使用标准文件系统工具。如果没有 sandbox backend,execute 工具甚至不会出现在智能体眼前。当前支持的 provider 包括 Daytona、Modal、Runloop 和 LangSmith Sandboxes,而且你只需要改一个配置,就可以在它们之间切换。

LangSmith Sandboxes 目前仍处于私有预览阶段,但值得单独点出来,因为它和运行时的其余部分是一起设计的。模板以声明方式定义容器镜像、资源限制和 volume。warm pool 会提前预配沙箱并自动补充,从而消除交互式智能体的冷启动延迟。auth proxy 则解决了每个团队迟早都会撞上的问题。智能体需要调用带认证的 API,但把凭证直接放进沙箱又有安全风险。这个代理会以 sidecar 形式运行,拦截出站请求,并自动从工作区 secrets 中注入凭证。于是沙箱内的代码调用 api.openai.com 时甚至不用自己带 header,代理会在请求发出时补上正确的 Authorization header。secret 永远不会进入沙箱,而智能体也无法外传自己看不到的东西。

https://a2a-protocol.org/

有一条安全建议值得反复强调。沙箱保护的是你的宿主机,不是沙箱本身。 一个能够控制智能体输入的攻击者,比如通过被抓取网页中的 prompt injection、恶意邮件、被污染的工具结果,就可以指示智能体在沙箱里执行命令。沙箱会把攻击者拦在你的机器之外,但凡是 沙箱内部 的东西,包括你直接放进去的凭证,都已经视为失陷。auth proxy 这种模式,正是为这个原因存在的。

集成

当智能体能够接入人和组织已经在使用的系统时,它们才最有价值。一个编码智能体如果能连上 GitHub、Linear 和你的 CI 系统,它就会强大得多。一个研究智能体如果能把输出直接送进你的发布流水线,它就会更实用。一个内部智能体如果能被其他智能体当作构件来调用,它就会变成一个平台。如果这些集成每一个都要手工写适配器,你的智能体最终还是会彼此孤立。智能体与外部世界之间的边界,就会变成一堵墙。

开放协议通过一种方式解决了这个问题。它让智能体和外部系统能够彼此发现、相互通信,而无需任何一方了解对方的实现细节。Agent Server 会自动提供三种集成表面。

MCP

MCP,也就是 Model Context Protocol,是把智能体接到工具和数据源上的开放标准。每一个 LangSmith Deployment 都会自动暴露一个 MCP endpoint,因此你的智能体可以被任何兼容 MCP 的客户端发现,比如 Claude Desktop、IDE、其他智能体或自定义应用,而你完全不用自己写适配代码。反过来,你的智能体也可以调用任何 MCP server,比如 Linear、GitHub、Notion 以及数百种其他服务,从而接入用户已经拥有的工具和数据。

A2A

A2A,也就是 Agent-to-Agent,是智能体之间通信的对应标准,而每个 deployment 也都会自动暴露一个 A2A endpoint。这使得跨 deployment 的多智能体架构变得可行。一个 deployment 中的 orchestrator 智能体,可以用双方都理解的协议去发现并调用另一个 deployment 中的 worker 智能体,而无需手工设计 HTTP 合约。

Webhooks

Webhooks 处理的是向外通知的场景。你的智能体完成了一次 run,而你希望无需轮询就触发下游动作。你只要在创建 run 时传入一个 webhook URL,服务器就会在完成后把 run payload 以 POST 方式发到那个地址。这正是把智能体运行串进现有工作流的方法。比如一轮研究完成后触发发布流水线,每日总结完成后通知 Slack,或者合规检查完成后写入审计日志。对于生产环境,headers、域名 allowlist 和 HTTPS 强制等项都可以配置。

Cron

到目前为止,我们一直在讨论的智能体,基本都是响应式的。用户发来一条消息,智能体做出回应。但很多真正有价值的智能体工作其实是主动式的,也就是按计划执行,不需要任何人手动触发。

尤其常见的是两类模式。

  1. 睡眠时间计算。 智能体在空闲时段完成有用工作,于是用户获得的是累积思考后的结果,而不是按需等待的延迟。比如一个研究智能体每晚运行一次,跟进你所在领域的新论文。一个准备型智能体在你开始新一天之前,先审阅明天的日历并起草 briefing notes。一个分诊型智能体把隔夜的支持工单先分好类,让团队一上班就能面对排好优先级的队列。整个工作发生在没人等待的时候,而用户一出现,结果已经准备好了。

  2. 健康检查与监控循环。 智能体定期检查某件事,并在发现问题时采取行动或升级上报。比如一个值班智能体每十五分钟审阅一次告警,一个智能体监测你的 staging 环境是否出现回归,一个合规智能体按固定节奏扫描是否有违反策略的情况。这些运行和面向用户的运行一样,同样需要持久性、trace 和 auth,只不过没有人在前台等它。

Agent Server 内置了 cron jobs,因此这些定时运行拥有和其他任何 run 完全相同的持久性、trace 和 auth 保证。你不需要再维护一个独立调度器,也不需要再单独接一套可观测性。你只需传入标准的 cron 表达式(UTC)和输入内容,服务器就会按计划触发运行。

这里有两种形态,分别适配不同模式。

  1. 有状态 cron,也就是 client.crons.create_for_thread,会把调度绑定到一个特定 thread_id 上,因此每次触发的运行都会追加到同一段对话中。这适合那些应该看见自己历史的智能体,比如一个每天累积昨天研究成果的研究智能体,或者一个记得自己已经标记过什么问题的监控智能体。

  2. 无状态 cron,也就是 client.crons.create,会为每次执行新建一个线程。这适合那些不需要跨运行连续性的批处理型工作。线程清理由 on_run_completed 控制。"delete" 是默认值,会在运行完成后删除线程;"keep" 则会保留线程,之后可以通过 client.runs.search(metadata={"cron_id": cron_id}) 进行检索。

https://www.langchain.com/blog/introducing-langsmith-sandboxes-secure-code-execution-for-agents

每一次 cron run 都会出现在 tracing 里,也会遵守 auth handler 和 middleware,并支持在失败后恢复。比如某个 cron 在凌晨 3 点撞上模型的瞬时故障,它不会悄无声息地失败,而是会像其他 run 一样被自动重试。还有一个运维提醒,事情做完后记得删除 cron。否则它会一直运行,也会一直计费。

我们看到企业团队的部署要求并不相同,因此这个 runtime 同时支持云端、混合和自托管部署。无论你运行在哪里,这些能力本身都是一致的。

deepagents deploy

deepagents deploy 是把你的智能体部署到上述 runtime 上的打包步骤。你在 deepagents.toml 中定义智能体,CLI 会把你的配置打包,并将其部署为一个 LangSmith Deployment,连同前面提到的所有能力一起交付。

https://docs.langchain.com/langsmith/double-texting

Memory 使用一个带可插拔后端的虚拟文件系统,为智能体同时提供临时 scratch space 与跨对话持久存储。Deep Agents 支持按用户或按 assistant 进行作用域划分的 memory,也支持两者同时存在。

Sandbox providers,包括 LangSmith Sandboxes、Daytona、Modal、Runloop 或自定义方案,只是一个配置值而已。一旦存在 sandbox,harness 就会自动加入一个 execute 工具。sandbox 的生命周期,无论是 thread 级还是 assistant 级,都通过 graph factories 处理。sandbox 内部的凭证则通过 sandbox auth proxy 管理,因此 API key 永远不会出现在沙箱代码或日志中。

Skills and instructions 会从你的 skills/ 目录和 AGENTS.md 中自动识别。MCP servers 则会从 mcp.json 中读取。唯一必填的配置项只有 name,其他部分都有合理默认值。

最终得到的是一个能够随时间持续演化的部署。你可以逐步加入新的技能、工具和 memory 策略,而不需要整套重写。关于完整的生产环境考量,比如凭证管理、异步模式、前端集成等,请参阅 going-to-production guide。

Open Harness

在智能体基础设施领域,正在出现一个越来越明显的趋势,那就是一旦转向托管方案,构建者的选择权反而会减少。你可能会被锁定在单一模型提供商、封闭的 harness,或者某些被藏在 API 背后的 harness 功能里,比如只能在某个生态中使用的、服务器端 compaction 生成的加密摘要。实际后果就是,团队会逐渐失去对智能体究竟如何工作的可见性,也失去在它工作不对时修改它的能力。

关于厂商锁定,还值得专门说一句。deepagents deploy 的设计目标,就是避免它。这个 harness 采用 MIT 许可并完全开源,智能体指令使用 AGENTS.md 这种开放标准,智能体通过开放协议暴露,包括 MCP、A2A、Agent Protocol。这里没有模型锁定,也没有沙箱锁定,harness 的任何部分都不是黑盒。默认 harness 具备如下能力。

https://docs.langchain.com/langsmith/online-evaluations-llm-as-judge

此外,Deep Agents 还允许你检查、定制并扩展智能体行为的每一层,包括限流、重试逻辑、模型 fallback、PII 检测,以及通过 LangChain middleware 控制文件权限。

把你的智能体带到生产环境

本指南列出的这些能力,包括持久执行、记忆、多租户、防护栏、human-in-the-loop、可观测性、沙箱化代码执行、定时运行等等,都是生产级智能体无法缺少的基础设施。deepagents deploy 把这一整套能力打包好,让团队不必从零拼装,同时又让整套栈始终保持开放、可配置,并真正属于你自己。

构建智能体,本质上是一个高度迭代的循环。trace 会暴露生产环境里真实发生的事,online evals 会在回归问题扩散之前把它们拦住,而 memory 则让智能体随着时间推移越来越有用。这套基础设施不只是支撑一个正在运行的智能体,它本身也是让智能体持续变好的根基。

如果你想试一试,quickstart 可以在几分钟内带你从 deepagents.toml 走到一个正在运行的 deployment。若想看完整的生产实践手册,包括 memory 作用域、sandbox 生命周期、凭证管理、防护栏以及前端集成,请参阅 going-to-production guide。若想更深入了解 runtime 本身,请查看 LangSmith Deployment 和 Agent Server 文档。

To build a good agent, you need a good harness. To deploy that agent, you need a good runtime.

要构建一个好的智能体,你需要一个好的 harness。要部署这个智能体,你还需要一个好的 runtime。

The harness is the system you build around the model to help your agent be successful in its domain. That includes prompts, tools, skills, and anything else supporting the model and tool calling loop that defines an agent. The runtime is everything underneath: durable execution, memory, multi-tenancy, observability, the machinery that keeps an agent running in production without your team reinventing it.

harness 是你围绕模型搭建的整套系统,用来帮助智能体在其领域中取得成功。其中包括提示词、工具、技能,以及任何支撑模型与工具调用循环的其他要素,而这个循环正是智能体的定义。runtime 则是更底层的一切,包括持久执行、记忆、多租户、可观测性,以及那套让智能体能在生产环境持续运行、而无需你的团队从头重复造轮子的机制。

This guide walks through the production requirements that surface once you deploy agents, the runtime capabilities that meet them, and how deepagents deploy packages those capabilities into something you can ship.

本指南会带你了解,智能体一旦部署后会浮现出哪些生产环境需求,哪些运行时能力可以满足这些需求,以及 deepagents deploy 如何把这些能力打包成真正可以交付上线的东西。

Runtime capabilities for production agents

生产级智能体所需的运行时能力

Throughout this section, "the runtime" refers to LangSmith Deployment (LSD) and its Agent Server: LSD runs agents in production, and Agent Server is the interface for assistants, threads, runs, memory, and scheduled jobs. The table below maps each production requirement to the runtime primitive that meets it.

在本节中,runtime 指的是 LangSmith Deployment(LSD)及其 Agent Server。LSD 负责让智能体在生产环境中运行,而 Agent Server 则是 assistants、threads、runs、memory 和 scheduled jobs 的接口。下表将每一项生产需求与能够满足它的运行时原语对应起来。

Durable execution

持久执行

Agents work by running a loop: Given a prompt, the model reasons, calls tools, observes the results, and repeats until it decides the task is complete.

智能体的工作方式,是运行一个循环。给定一个提示,模型会进行推理、调用工具、观察结果,然后不断重复,直到它判断任务已经完成。

Unlike a typical web request that returns in milliseconds, this loop can span minutes or hours. A single run might make dozens of model calls, spawn subagents, or wait indefinitely for a human to approve a draft. A crash, deploy, or transient failure anywhere in that loop shouldn't erase the work leading up to it.

不同于通常会在毫秒内返回的 Web 请求,这个循环可能持续数分钟甚至数小时。一次运行可能发起几十次模型调用,生成子智能体,或者无限期地等待人工批准一份草稿。在这个循环中的任何位置发生崩溃、部署变更或瞬时故障,都不该抹掉之前已经完成的工作。

In practice, you feel it in two places:

在实际中,这种需求会集中体现在两个地方。

Long runs need to survive infrastructure failures. A research agent spending twenty minutes gathering sources and synthesizing findings can't afford to restart from scratch if the worker process dies: the agent already paid for the tokens and executed the tool calls. What you want is resumption from the last completed step, with all prior state intact.

长时间运行必须能扛住基础设施故障。 一个研究型智能体如果花了二十分钟收集资料、综合结论,那么一旦 worker 进程挂掉,它就不能从头再来,因为 token 已经花掉了,工具调用也已经执行过了。理想情况是,它能从最后一个完成的步骤继续恢复,并保留此前的全部状态。

Agents need to be able to stop and wait. An agent that pauses for a human to approve a transaction doesn't know if the human will respond in thirty seconds or three days. Tying up a worker process or a client connection for that entire window isn't viable. The agent needs to truly stop: free resources, release workers, then pick up later exactly where it left off.

智能体必须能够真正停下来等待。 一个在等待人工批准交易的智能体,并不知道人会在三十秒后回复,还是三天后回复。在这整个等待窗口里一直占着 worker 进程或客户端连接,是不可行的。智能体需要真正停住,释放资源、交还 worker,然后在之后精确地从原地继续。

Both requirements are solved by the same thing: durable execution.

这两个需求,本质上都由同一件事解决,那就是持久执行。

  • Agents run on a managed task queue with automatic checkpointing, so any run can be retried, replayed, or resumed from the exact point of interruption.
  • 智能体运行在受管任务队列上,并带有自动 checkpoint,因此任何一次运行都可以从中断点精确地重试、重放或恢复。
  • Each super-step of graph execution writes a checkpoint to the persistence layer (PostgreSQL by default), keyed by a thread_id that acts as a persistent cursor into the run.
  • 图执行的每个 super-step 都会向持久层写入一个 checkpoint,默认写入 PostgreSQL,并以 thread_id 作为键。这个 thread_id 就像指向该次运行的持久游标。

  • When a worker crashes, the run's lease is released and another worker picks it up from the latest checkpoint.
  • 当某个 worker 崩溃时,这次运行的 lease 会被释放,并由另一个 worker 从最新 checkpoint 接手。
  • When an agent waits for human input, the process hands off its slot and the run sleeps indefinitely until resumed.
  • 当智能体等待人工输入时,进程会交出自己的执行槽位,而这次运行则会无限期休眠,直到被恢复。
  • Configurable retry policies control backoff, max attempts, and which exceptions trigger retries on a per-node basis.
  • 可配置的重试策略可以按节点控制退避方式、最大尝试次数,以及哪些异常会触发重试。

Durability is the foundation the rest of this list depends on. Because execution can pause and resume across process boundaries, agents can wait indefinitely for human input, run in the background, survive deploys mid-run, and handle concurrent inputs without corrupting state.

持久性是本列表其余能力赖以成立的基础。正因为执行可以跨进程边界暂停和恢复,智能体才能无限期等待人工输入、在后台运行、在运行中途经历部署变更而不丢状态,并在并发输入下不破坏状态。

Memory

记忆

Agents need two different kinds of memory, and the distinction matters.

智能体需要两种不同的记忆,而且这一区分非常重要。

Short-term memory is what the agent accumulates within a single conversation. The messages exchanged, the tool calls made, the intermediate state built up across a run. This lives in the checkpoint for the thread, scoped to a thread_id, and disappears (conceptually) when the conversation ends. A follow-up message on the same thread sees everything that came before on that thread.

短期记忆 是智能体在单次对话 内部 累积起来的内容。包括来回交换的消息、已经发起的工具调用、以及一次运行过程中逐步构建的中间状态。这些内容存在该线程的 checkpoint 里,作用域限定在某个 thread_id 上,并且在概念上会随着这段对话结束而结束。同一线程上的后续消息,能够看到此前在这个线程里发生过的一切。

Long-term memory is what the agent carries across conversations. This can include user preferences learned across conversations, project conventions and best practices, or a knowledge base enhanced with each new query. None of this belongs to any single thread. It's user-level or organization-level context that should persist across every conversation the agent has. Checkpoints alone can't do this, because checkpoint state is scoped to a single thread.

长期记忆 是智能体在 跨对话 场景中持续携带的内容。它可能包括跨多次对话学到的用户偏好、项目约定与最佳实践,或者随着每次新查询不断增强的知识库。这些内容都不属于某一个单独的线程,而是用户级或组织级上下文,应该在智能体参与的每一段对话之间持续存在。仅靠 checkpoint 做不到这一点,因为 checkpoint 的状态只绑定单个线程。

Long-term memory is what the Agent Server's built-in store is for. It's a key-value interface where memories are organized by namespace tuples (for example, (user_id, "memories")) and persisted across threads. Your agent writes to the store in one conversation and reads from it in the next. Backed by PostgreSQL by default, it supports semantic search via embedding configuration so agents can retrieve memories by meaning rather than exact match, and you can swap in a custom backend if you need different storage characteristics. The namespace structure is flexible: scope by user, assistant, organization, or any combination that fits your data model.

长期记忆正是 Agent Server 内置 store 的用途所在。它是一个键值接口,记忆按照 namespace tuple 组织,例如 (user_id, "memories"),并能够跨线程持久保存。你的智能体可以在一次对话中写入 store,在下一次对话中再读出来。它默认由 PostgreSQL 支撑,并通过 embedding 配置支持语义搜索,因此智能体可以按语义而非精确匹配来检索记忆。如果你需要不同的存储特性,也可以替换为自定义后端。namespace 结构本身也很灵活,可以按用户、assistant、组织,或任何适合你数据模型的组合进行划分。

Because memory that accumulates over months is some of the most valuable data the system produces, it matters where it lives. The store is queryable directly via API, and if you self-host, it lives in your own PostgreSQL instance. Keeping this data in a standard format you control is what lets you migrate between models, analyze it, or build on top of it outside the agent itself.

因为那些经过数月积累下来的记忆,往往是系统产出的最有价值的数据之一,所以它存放在哪里,非常重要。这个 store 可以直接通过 API 查询;如果你是自托管,它就存在你自己的 PostgreSQL 实例里。把这些数据保存在你能控制的标准格式中,才能让你在不同模型之间迁移、对其进行分析,或在智能体之外继续利用它。

Multi-tenancy

多租户

The moment your agent serves more than one user, a set of problems appears that didn't exist in single-player mode. These break down into three distinct concerns, and the Agent Server handles each with its own primitive.

智能体一旦开始为不止一个用户提供服务,就会立刻冒出一系列在单人模式下根本不存在的问题。这些问题可以拆成三个不同层面,而 Agent Server 为每一层都提供了独立的原语。

Isolating one user's data from another. User A's run should only touch User A's threads, and only read User A's memories. Custom authentication runs as middleware on every request: your @auth.authenticate handler validates the incoming credential and returns the user's identity and permissions, which get attached to the run context. Authorization handlers registered with @auth.on.threads, @auth.on.assistants.create, and so on then enforce who can see or modify what by tagging resources with ownership metadata on creation and returning filter dictionaries on reads. Handlers are matched from most specific to least, so you can start with a single global handler and add resource-specific ones as your model grows.

将一个用户的数据与另一个用户隔离开。 用户 A 的运行,只能访问用户 A 的线程,也只能读取用户 A 的记忆。自定义认证会作为 middleware 运行在每一个请求上。你的 @auth.authenticate handler 会校验传入凭证,并返回用户的身份与权限,这些信息随后会附加到运行上下文里。注册在 @auth.on.threads@auth.on.assistants.create 等位置的授权处理器,则会在资源创建时打上归属元数据,并在读取时返回过滤字典,从而决定谁可以查看或修改哪些内容。handler 的匹配顺序从最具体到最通用,因此你可以先从一个全局 handler 起步,再随着模型演进逐步加入更细分的资源级 handler。

Letting the agent act on behalf of a user. Agents often need to call third-party services using the user's credentials—reading their calendar, posting to their Slack, opening a PR in their repo. Agent Auth handles the OAuth dance and token storage for this pattern, so the agent gets user-scoped credentials at runtime without you managing the refresh flow yourself. The user authenticates once; the agent can act on their behalf across subsequent runs.

让智能体代表用户行动。 智能体经常需要使用用户自己的凭证去调用第三方服务,比如读取 他们的 日历、向 他们的 Slack 发消息,或者在 他们的 仓库里开一个 PR。Agent Auth 负责处理这一模式中的 OAuth 流程和 token 存储,因此在运行时,智能体可以拿到用户作用域下的凭证,而无需你自己管理 refresh 流程。用户只需认证一次,智能体就可以在之后的多次运行中持续代表他们行动。

Controlling who can operate the system itself. Separate from end-user access, there's the question of which members of your team can deploy agents, configure them, view traces, or change auth policies. RBAC handles this operator-level access control.

控制谁能够操作系统本身。 除了终端用户访问之外,还存在另一个问题,那就是 你自己团队 里的哪些成员可以部署智能体、配置它们、查看 trace,或修改认证策略。RBAC 处理的就是这一层面的操作者访问控制。

The three layers compose: end users authenticate via your auth handler, the agent calls third-party services via Agent Auth, and your team operates the deployment under RBAC policies.

这三层可以自然组合起来。终端用户通过你的 auth handler 完成认证,智能体通过 Agent Auth 调用第三方服务,而你的团队则在 RBAC 策略之下操作整个部署系统。

Human-in-the-loop (HITL)

Human-in-the-loop(HITL)

Agents work by running a loop: given a prompt, a model reasons and decides to call tools, observes the results, and repeats until it decides it has completed the task at hand. Most of the time you want that loop to run uninterrupted. That’s where the value comes from. But sometimes you need a human in the middle of the loop at key decision points.

智能体的工作方式,是运行一个循环。给定一个提示,模型会推理并决定调用工具,观察结果,然后重复,直到它判断自己已经完成当前任务。多数时候,你希望这个循环不中断地跑下去,这正是它产生价值的地方。但有时,你需要在这个循环的关键决策点,把人放进中间。

There are two common situations where this comes up:

常见有两种情况。

  1. Reviewing a proposed tool call. Before the agent executes a consequential action (sending an email, executing a financial transaction, deleting files), you want a human to see exactly what it's about to do and decide how to respond. Take the email case: the agent drafts a message and pauses before sending. You can approve it as-is, edit the subject or body before it goes out, or reject it with a reason and specific edit requests so the agent can revise and try again.
  1. 审查一项拟执行的工具调用。 在智能体执行某个后果重大的动作之前,比如发送邮件、执行金融交易、删除文件,你希望有人能准确看到它即将做什么,并决定如何回应。拿发送邮件来说,智能体先起草消息,然后在真正发出前暂停。你可以原样批准,也可以在发出前修改主题或正文,或者附上理由与具体修改要求后拒绝,让智能体重新修改再试一次。
  1. An agent asking a clarifying question. Sometimes an agent reaches a decision point it can't resolve on its own, not because it lacks a tool but because the right answer depends on human judgment or preference. Rather than guessing, the agent can surface the question directly: "I found three config files matching that pattern. Which one should I modify?" or "Should this deploy to staging or production?" Your answer becomes the return value of the interrupt, and the agent continues from exactly where it stopped.
  1. 智能体主动提出澄清问题。 有时智能体会走到一个它无法自行解决的决策点,这不是因为缺工具,而是因为正确答案依赖人的判断或偏好。与其猜测,它可以直接把问题抛出来,比如 我找到了三个符合该模式的配置文件。应该修改哪一个。或者 这次部署应该发到 staging 还是 production。你的回答会成为这次 interrupt 的返回值,而智能体会从它停下来的地方继续往下走。

The Agent Server handles this with two primitives: interrupt() pauses execution and surfaces a payload to the caller; Command(resume=...) continues it with the human's response. Together they let you build approval gates, draft review loops, input validation, and any workflow where a human needs to weigh in mid-execution.

Agent Server 用两个原语来处理这件事。interrupt() 用来暂停执行,并把一个 payload 暴露给调用方;Command(resume=...) 则带着人的回应继续执行。它们配合起来,可以构建审批闸门、草稿审阅循环、输入校验,以及任何需要人在执行中途参与判断的工作流。

Under the hood, interrupt() triggers the runtime's checkpointer to write the full graph state to durable storage, keyed by a thread_id that acts as a persistent cursor. The process then frees resources and waits indefinitely. Unlike static breakpoints that pause before or after specific nodes, interrupt() is dynamic: place it anywhere in your code, wrap it in conditionals, or embed it inside a tool function so approval logic travels with the tool. When Command(resume=...) arrives—minutes, hours, or days later—the resume value becomes the return value of the interrupt() call, and execution picks up exactly where it stopped. Because resume accepts any JSON-serializable value, the response isn't limited to approve/reject: a reviewer can return an edited draft, a human can supply missing context, a downstream system can inject computed results. When parallel branches each call interrupt(), all pending interrupts are surfaced together and can be resumed in a single invocation, or one at a time as responses come back.

在底层,interrupt() 会触发运行时的 checkpointer,把完整的图状态写入持久存储,并以 thread_id 作为持久游标键。随后进程释放资源并无限期等待。不同于那些会在特定节点前后暂停的静态断点,interrupt() 是动态的。你可以把它放在代码里的任何位置,可以放在条件分支中,也可以直接嵌进某个工具函数里,让审批逻辑跟着工具一起走。当 Command(resume=...) 在几分钟、几小时甚至几天后到达时,这个 resume 值就会成为 interrupt() 调用的返回值,而执行会从停下来的原地继续。由于 resume 可以接受任何可 JSON 序列化的值,因此响应不局限于批准或拒绝。审阅者可以返回一份修改过的草稿,人可以补充缺失的上下文,下游系统也可以注入计算结果。当并行分支中各自调用 interrupt() 时,所有待处理的 interrupt 会一起暴露出来,并且可以在一次调用中统一恢复,也可以随着回应陆续回来逐个恢复。

Real-time interaction

实时交互

Human-in-the-loop is an interaction mode where execution can pause for a person to review or provide input—sometimes immediately, sometimes much later. Separately, there are “live session” problems that show up when the agent is actively working while the user is present: making progress visible (streaming) and coordinating concurrent messages (double-texting).

Human-in-the-loop 是一种交互模式,它允许执行暂停下来,等待人来审阅或提供输入,有时是立刻,有时则会晚很多。除此之外,还有另一类实时问题,会在用户在线、而智能体正在主动工作时出现,也就是让进度可见的 streaming,以及协调并发消息的 double-texting。

Streaming

Streaming

An agent that takes thirty seconds to produce a response leaves the user staring at a spinner with no signal about whether it's making progress, stuck, or about to fail. They also can't start reading the answer until the whole thing is done. Streaming solves both: partial output flows to the client as the agent produces it, so the user sees the response materialize in real time.

如果一个智能体要花三十秒才能给出回答,用户这三十秒里只能盯着一个转圈,不知道它到底是在推进、卡住了,还是马上就要失败。而且在完整回答结束之前,用户也无法开始阅读。streaming 同时解决了这两个问题。智能体在生成内容时,部分输出会持续流向客户端,因此用户能实时看到回答逐步成形。

The Streaming API supports several modes depending on what granularity you want: full state snapshots after each graph step, state updates only, token-by-token LLM output, or custom application events. You can also combine them. Run streaming (client.runs.stream()) is scoped to a single run; thread streaming (client.threads.joinStream()) opens a long-lived connection that delivers events from every run on a thread, useful when follow-up messages, background runs, or HITL resumptions all trigger activity on the same thread.

Streaming API 支持多种模式,取决于你希望看到多细的粒度。你可以在每个图步骤后获取完整状态快照,也可以只获取状态更新,只获取逐 token 的 LLM 输出,或者获取自定义应用事件。它们也可以组合使用。运行级 streaming,也就是 client.runs.stream(),作用域是单次 run;线程级 streaming,也就是 client.threads.joinStream(),则会打开一个长连接,持续接收某个线程上所有 run 产生的事件。当后续消息、后台运行或 HITL 恢复都在同一个线程上触发活动时,这种模式就很有用。

Thread streaming supports resumption via the Last-Event-ID header: the client reconnects with the ID of the last event it received, and the server replays from there with no gaps. Without this, every dropped connection means the client either misses output or has to start over.

线程级 streaming 支持通过 Last-Event-ID header 恢复。客户端会携带它收到的最后一个事件 ID 重新连接,服务器则会从那个位置继续回放,不会留下空档。如果没有这个机制,每次连接中断都会导致客户端要么漏掉输出,要么只能从头开始。

Double-texting

Double-texting

The second real-time problem: a user sends a new message while the agent is still working on the previous one. This happens constantly in chat UIs. Someone types a question, realizes they meant something slightly different, and fires off a correction before the first run finishes. We call this double-texting, and the runtime has to take a position on how to handle it.

第二类实时问题是,智能体还在处理上一条消息时,用户又发来了一条新消息。这在聊天 UI 中几乎时时都在发生。有人先发出一个问题,随后意识到自己想说的略有不同,于是在第一轮运行结束前又补发了一条修正。我们把这叫作 double-texting,而运行时必须明确决定该怎么处理。

There are four strategies, and the right one depends on your application:

这里有四种策略,具体哪种合适,取决于你的应用。

  • enqueue (the default): The new input waits for the current run to finish, then processes sequentially.
  • enqueue(默认):新输入等待当前运行结束后,再按顺序处理。
  • reject: Refuse any new input until the current run finishes.
  • reject:在当前运行完成之前,拒绝所有新输入。

  • interrupt: Halt the current run, preserve progress, and process the new input from that state. Useful when the second message builds on the first.
  • interrupt:中止当前运行,保留已有进展,并基于该状态处理新输入。这适合第二条消息建立在第一条消息之上的情况。
  • rollback: Halt the current run, revert all progress including the original input, and process the new message as a fresh run. Useful when the second message replaces the first.
  • rollback:中止当前运行,撤销所有进展,包括原始输入,然后把新消息当作一次全新运行来处理。这适合第二条消息直接替代第一条消息的情况。

interrupt gives the snappiest chat feel but requires your graph to handle partial tool calls cleanly (a tool call initiated but not completed when the interrupt hits may need cleanup on resume). enqueue is the safest default—no state corruption, at the cost of making the user wait.

interrupt 能带来最灵敏的聊天体验,但它要求你的图能干净地处理部分完成的工具调用,也就是说,某个工具调用在中断发生时已经启动但尚未完成,恢复时可能需要清理。enqueue 则是最稳妥的默认值,不会破坏状态,代价是用户必须等待。

Guardrails

防护栏

Not every production concern can be expressed as "run the loop durably." Some have to shape the loop itself: intercepting model inputs, filtering tool outputs, enforcing limits on expensive operations. These policies belong in code, not in a prompt. They need to run every time, not whenever the model happens to remember them.

并不是所有生产环境问题,都能简单归结为 让循环持久运行。有些问题必须直接塑造循环本身,比如拦截模型输入、过滤工具输出、为高成本操作设置上限。这些策略应该写在代码里,而不是塞进提示词里。它们必须每一次都执行,而不是靠模型偶尔记得去遵守。

Two cases make this concrete:

两个例子能把这件事讲得很具体。

  1. Redacting sensitive data before the model sees it. A customer support agent processes user messages containing PII (names, emails, account numbers). You don't want the model to see them, you don't want them in traces, and compliance likely requires redaction before logging. This has to happen before every model call, deterministically.
  1. 在模型看到敏感数据之前就先做脱敏。 一个客服智能体会处理包含个人敏感信息的用户消息,比如姓名、邮箱、账号号码。你不希望模型看到这些内容,也不希望它们出现在 trace 里,而且合规要求通常也会要求在日志记录前先完成脱敏。这必须在每一次模型调用前,以确定性的方式发生。
  1. Capping expensive operations. An agent that can call a paid external API needs a hard ceiling on how many calls it makes per run, because a confused model will otherwise happily call it fifty times and burn through your budget before lunch.
  1. 为高成本操作设置硬上限。 一个能够调用付费外部 API 的智能体,必须对每次运行中的调用次数设定一个硬性上限,否则模型一旦困惑起来,很可能会愉快地调用五十次,在中午之前就把你的预算烧光。

Both are handled by middleware, which wraps the agent loop at defined hooks—before_model, wrap_model_call, wrap_tool_call, after_model—so policies execute deterministically around every relevant step.

这两类问题都由 middleware 处理。middleware 会在既定 hook 上包裹整个智能体循环,比如 before_modelwrap_model_callwrap_tool_callafter_model,从而保证这些策略会在每一个相关步骤周围被确定性地执行。

LangChain ships built-in middleware covering the common cases: PIIRedactionMiddleware, ModelRetryMiddleware, ModelFallbackMiddleware, ToolCallLimitMiddleware, SummarizationMiddleware, HumanInTheLoopMiddleware, OpenAIModerationMiddleware, and you can write custom middleware for application-specific policies.

LangChain 已经内置了覆盖常见需求的 middleware,比如 PIIRedactionMiddleware、ModelRetryMiddleware、ModelFallbackMiddleware、ToolCallLimitMiddleware、SummarizationMiddleware、HumanInTheLoopMiddleware、OpenAIModerationMiddleware。你也可以针对具体应用策略编写自定义 middleware。

Middleware is open source, but it only really pays off when it runs inside the agent runtime. When it does, those same policies become part of every interaction mode the runtime supports—streaming, human-in-the-loop pauses/resumes, retries, background runs, and long-lived threads. In practice, that means your guardrails and instrumentation aren’t “best effort”: they consistently wrap every model call and every tool call, at the exact points you expect, no matter what the agent is doing.

middleware 是开源的,但只有当它运行在智能体 runtime 内部 时,它的价值才真正发挥出来。一旦如此,这些相同的策略就会自然成为 runtime 支持的每一种交互模式的一部分,无论是 streaming、human-in-the-loop 的暂停与恢复、重试、后台运行,还是长生命周期线程。在实际里,这意味着你的 guardrails 和埋点不再只是 尽力而为,而是会在你预期的确切位置,稳定地包裹住每一次模型调用和每一次工具调用,不管智能体当时正在做什么。

Observability

可观测性

You don't know what an agent will do in production until you run it. Unlike a traditional application where you can reason about behavior from the code, an agent's execution path depends on the model's choices at runtime: which tools to call, what to pass them, how to interpret the results, and when to give up and try something else. When something goes wrong, you can't just re-read the function. You need to see what actually happened.

在真正把智能体跑到生产环境里之前,你并不知道它会做什么。传统应用的行为通常还能从代码中推演出来,但智能体的执行路径依赖的是模型在运行时做出的选择,比如调用哪些工具、传入什么参数、如何解释结果、什么时候放弃并换个办法。当事情出错时,你不能只是重新读一遍函数代码。你必须看到实际发生了什么。

A support ticket says "the agent kept asking the same question over and over." Without traces, you're guessing from the user's description. With traces, you see the full execution tree: the user's message, the model's planned response, the tool it called, the result it got back, the next message it generated, the loop it fell into. You can filter by cost to find runs that burned through tokens, by error to find runs that failed, by user to see what a specific customer experienced. You can spot patterns across thousands of runs that no individual trace would reveal.

比如有一张支持工单写着,智能体一直在重复问同一个问题。如果没有 trace,你只能根据用户的描述猜。可一旦有了 trace,你就能看到完整的执行树,包括用户的消息、模型原本计划给出的响应、它调用的工具、工具返回的结果、它随后生成的下一条消息,以及它最后陷入的那个循环。你还可以按成本过滤,找出哪些运行烧掉了大量 token;按错误过滤,找出哪些运行失败了;按用户过滤,看看某个具体客户经历了什么。你甚至能在成千上万次运行中看出单条 trace 根本无法暴露的模式。

Every LangSmith Deployment is automatically wired to a tracing project. You get the full execution tree out of the box—model calls, tool calls, subagent runs, middleware hooks—with structured metadata you can query by user, time window, cost, latency, error state, feedback, or custom tags.

每一个 LangSmith Deployment 都会自动连接到一个 tracing 项目。你开箱就能拿到完整的执行树,包括模型调用、工具调用、子智能体运行、middleware hook,以及可以按用户、时间窗口、成本、延迟、错误状态、反馈或自定义标签查询的结构化元数据。

Traces aren't just a debugging tool; they're the foundation of the improvement loop:

trace 不只是调试工具,它还是改进闭环的基础。

Polly, the LangSmith AI assistant, analyzes traces and surfaces insights—common failure modes, slow tool calls, repeated patterns—so you're not reading thousands by hand. Online Evals run LLM-as-judge or custom scorers against production traces automatically, so regressions get caught as they happen. We used this loop to improve Deep Agents by 13.7 points on Terminal Bench 2.0 by only changing the harness—the whole argument for why the agent improvement loop starts with a trace is worth reading in full.

LangSmith 的 AI 助手 Polly 会分析 trace,并给出洞察,比如常见失败模式、缓慢的工具调用、反复出现的模式,这样你就不用手动去读成千上万条。Online Evals 会自动对生产 trace 跑 LLM-as-judge 或自定义评分器,因此回归问题能够在出现时就被发现。我们正是用这套闭环,只改动 harness,就把 Deep Agents 在 Terminal Bench 2.0 上的成绩提升了 13.7 分。关于为什么智能体改进循环要从 trace 开始,这整套论证本身也很值得完整读一遍。

Time travel

时间旅行

Observability tells you what happened. Time travel lets you ask what would have happened if something had gone differently.

可观测性会告诉你发生了什么。时间旅行则让你追问,如果当时有某件事不一样,会发生什么

The motivating case is debugging a run that went off the rails. Your agent made a bad decision at step 5 of a 20-step run: it called the wrong tool, misread a tool result, or asked a clarifying question when it should have kept going. You want to understand why, and you want to try alternatives without re-running the whole thing from scratch. More generally, any time an agent's path depends on state at a particular checkpoint, you want the ability to rewind to that checkpoint, change the state, and let the rest of the run unfold differently.

最典型的场景,是调试一条跑偏了的运行。你的智能体在一次 20 步的运行中,于第 5 步做出了错误决策。它可能调用了错误的工具,误读了工具结果,或者在本该继续执行时反而提出了澄清问题。你想知道为什么,也想尝试其他分支,但又不想把整段流程从头重跑。更一般地说,只要智能体的路径依赖于某个 checkpoint 上的状态,你就会希望能够回退到那个 checkpoint,修改状态,然后观察剩余执行如何沿着不同路径展开。

Because every super-step writes a checkpoint, every point in a run's history is already a snapshot you can return to. Time travel makes this explicit: pick a checkpoint from a thread's history, optionally modify its state, and resume from there. The modified checkpoint forks the thread's history. The original stays intact, and the new path runs forward as its own branch. LLM calls, tool calls, and interrupts all re-trigger on replay, so forks exercise the real agent loop rather than a stub of it.

由于每个 super-step 都会写入一个 checkpoint,因此运行历史上的每一个点,本来就是一个可以返回的快照。时间旅行只是把这件事显式化。你从某个线程的历史中选一个 checkpoint,可选地修改它的状态,然后从那里恢复继续。修改后的 checkpoint 会分叉出该线程的历史。原始历史保持不变,而新的路径则作为自己的分支向前运行。LLM 调用、工具调用以及 interrupt 都会在重放中重新触发,因此这些分叉跑的是真实的智能体循环,而不是某个替身版本。

This unlocks patterns that are hard to build otherwise: debugging why the agent chose tool A when it should have chosen tool B, comparing two prompts against the same upstream context, recovering from a run that went sideways by rewinding to the last good state, or exploring counterfactuals across many forks to understand model behavior. The LangSmith Studio UI gives you a visual interface for all of this; the API is what most production debugging workflows end up using.

这会解锁一些否则很难搭建的模式,比如调试为什么智能体选了工具 A 而不是本该选择的工具 B,在完全相同的上游上下文下比较两份提示词,从一次已经跑偏的运行中回退到最后一个健康状态重新继续,或者在多个分叉上探索反事实,以理解模型行为。LangSmith Studio UI 为这一切提供了可视化界面,而在大多数生产调试工作流中,大家最终更常用的是 API。

Code execution

代码执行

An agent that can only call the tools you pre-wired is limited to what you anticipated. An agent that can run arbitrary code is general-purpose: it can install dependencies, clone repos, execute tests, run data analysis, generate documents, and render plots. This is the gap between "chatbot with function calling" and "agent that can actually do things."

一个只能调用你预先接好线的工具的智能体,能力天然受限于你的预判。一个能够运行任意代码的智能体,则是通用型的。它可以安装依赖、克隆仓库、执行测试、做数据分析、生成文档、渲染图表。这正是 具备函数调用的聊天机器人 和 真正能做事的智能体 之间的差距。

Arbitrary code execution requires isolation. If the agent runs rm -rf / on your host, you have a bad day. If it reads your environment variables, it exfiltrates your API keys. You need a boundary between the agent's execution environment and everything you care about, and you need it before the agent writes its first command.

任意代码执行要求隔离。如果智能体在你的宿主机上运行 rm -rf /,那你的日子就不好过了。如果它读到了你的环境变量,它就能把 API key 外传。在智能体写出第一条命令之前,你就必须先在它的执行环境与你所重视的一切之间,建立一道边界。

In Deep Agents, isolation happens through sandbox backends. When you configure a backend that implements SandboxBackendProtocol, the agent automatically gets an execute tool for running shell commands in the sandbox alongside the standard filesystem tools. Without a sandbox backend, the execute tool isn't even visible to the agent. Supported providers include Daytona, Modal, Runloop, and LangSmith Sandboxes, and you can swap between them with a single configuration change.

在 Deep Agents 里,这种隔离是通过 sandbox backend 实现的。当你配置了一个实现 SandboxBackendProtocol 的 backend 时,智能体就会自动获得一个 execute 工具,用于在沙箱中运行 shell 命令,同时仍可使用标准文件系统工具。如果没有 sandbox backend,execute 工具甚至不会出现在智能体眼前。当前支持的 provider 包括 Daytona、Modal、Runloop 和 LangSmith Sandboxes,而且你只需要改一个配置,就可以在它们之间切换。

LangSmith Sandboxes (currently in private preview) are worth a specific callout because they're built to integrate with the rest of the runtime. Templates define container images, resource limits, and volumes declaratively. Warm pools pre-provision sandboxes with automatic replenishment, eliminating cold start latency for interactive agents. And the auth proxy solves a problem every team hits eventually: the agent needs to call authenticated APIs, but putting credentials inside the sandbox is a security risk. The proxy runs as a sidecar, intercepts outbound requests, and injects credentials from workspace secrets automatically—the sandbox code calls api.openai.com with no headers, and the proxy adds the right Authorization header on the way out. Secrets never enter the sandbox, and the agent can't exfiltrate what it can't see.

LangSmith Sandboxes 目前仍处于私有预览阶段,但值得单独点出来,因为它和运行时的其余部分是一起设计的。模板以声明方式定义容器镜像、资源限制和 volume。warm pool 会提前预配沙箱并自动补充,从而消除交互式智能体的冷启动延迟。auth proxy 则解决了每个团队迟早都会撞上的问题。智能体需要调用带认证的 API,但把凭证直接放进沙箱又有安全风险。这个代理会以 sidecar 形式运行,拦截出站请求,并自动从工作区 secrets 中注入凭证。于是沙箱内的代码调用 api.openai.com 时甚至不用自己带 header,代理会在请求发出时补上正确的 Authorization header。secret 永远不会进入沙箱,而智能体也无法外传自己看不到的东西。

One piece of security guidance worth repeating: sandboxes protect your host, not the sandbox itself. An attacker who controls the agent's input (via prompt injection in a scraped webpage, a malicious email, a poisoned tool result) can instruct the agent to run commands inside the sandbox. The sandbox keeps the attacker off your machine, but anything inside the sandbox—including credentials placed there directly—is compromised. The auth proxy pattern exists for exactly this reason.

有一条安全建议值得反复强调。沙箱保护的是你的宿主机,不是沙箱本身。 一个能够控制智能体输入的攻击者,比如通过被抓取网页中的 prompt injection、恶意邮件、被污染的工具结果,就可以指示智能体在沙箱里执行命令。沙箱会把攻击者拦在你的机器之外,但凡是 沙箱内部 的东西,包括你直接放进去的凭证,都已经视为失陷。auth proxy 这种模式,正是为这个原因存在的。

Integrations

集成

Agents are most useful when they plug into the systems people and organizations already use. A coding agent becomes more powerful when it can reach into GitHub, Linear, and your CI system. A research agent becomes more useful when its output feeds into your publishing pipeline. An internal agent becomes a platform when other agents can call it as a building block. If every one of those integrations is a hand-rolled adapter, your agents stay isolated. The boundary between "agent" and "everything else" becomes a wall.

当智能体能够接入人和组织已经在使用的系统时,它们才最有价值。一个编码智能体如果能连上 GitHub、Linear 和你的 CI 系统,它就会强大得多。一个研究智能体如果能把输出直接送进你的发布流水线,它就会更实用。一个内部智能体如果能被其他智能体当作构件来调用,它就会变成一个平台。如果这些集成每一个都要手工写适配器,你的智能体最终还是会彼此孤立。智能体与外部世界之间的边界,就会变成一堵墙。

Open protocols solve this by letting agents and external systems discover and talk to each other without either side knowing the other's implementation. The Agent Server provisions three integration surfaces automatically.

开放协议通过一种方式解决了这个问题。它让智能体和外部系统能够彼此发现、相互通信,而无需任何一方了解对方的实现细节。Agent Server 会自动提供三种集成表面。

MCP

MCP

MCP (Model Context Protocol) is the open standard for connecting agents to tools and data sources. Every LangSmith Deployment automatically exposes an MCP endpoint, making your agent discoverable by any MCP-compliant client—Claude Desktop, IDEs, other agents, custom applications—without you writing adapter code. In the other direction, your agent can call out to any MCP server (Linear, GitHub, Notion, and hundreds of others) to reach tools and data your users already have.

MCP,也就是 Model Context Protocol,是把智能体接到工具和数据源上的开放标准。每一个 LangSmith Deployment 都会自动暴露一个 MCP endpoint,因此你的智能体可以被任何兼容 MCP 的客户端发现,比如 Claude Desktop、IDE、其他智能体或自定义应用,而你完全不用自己写适配代码。反过来,你的智能体也可以调用任何 MCP server,比如 Linear、GitHub、Notion 以及数百种其他服务,从而接入用户已经拥有的工具和数据。

A2A

A2A

A2A (Agent-to-Agent) is the analogous standard for agent-to-agent communication, and every deployment exposes an A2A endpoint automatically as well. This is what makes multi-agent architectures across deployments tractable: an orchestrator agent in one deployment can discover and call worker agents in another using a protocol both sides understand, with no hand-rolled HTTP contracts.

A2A,也就是 Agent-to-Agent,是智能体之间通信的对应标准,而每个 deployment 也都会自动暴露一个 A2A endpoint。这使得跨 deployment 的多智能体架构变得可行。一个 deployment 中的 orchestrator 智能体,可以用双方都理解的协议去发现并调用另一个 deployment 中的 worker 智能体,而无需手工设计 HTTP 合约。

Webhooks

Webhooks

Webhooks handle the outbound case: your agent finishes a run, and you want to kick off something downstream without polling. Pass a webhook URL when creating a run, and the server POSTs the run payload to that URL on completion. This is how you chain agent runs into existing workflows—a research run completes and triggers a publishing pipeline, a daily summary finishes and notifies Slack, a compliance check completes and writes to your audit log. Headers, domain allowlists, and HTTPS enforcement are all configurable for production environments.

Webhooks 处理的是向外通知的场景。你的智能体完成了一次 run,而你希望无需轮询就触发下游动作。你只要在创建 run 时传入一个 webhook URL,服务器就会在完成后把 run payload 以 POST 方式发到那个地址。这正是把智能体运行串进现有工作流的方法。比如一轮研究完成后触发发布流水线,每日总结完成后通知 Slack,或者合规检查完成后写入审计日志。对于生产环境,headers、域名 allowlist 和 HTTPS 强制等项都可以配置。

Cron

Cron

The agents we've been talking about so far are reactive: a user sends a message, the agent responds. But a lot of valuable agent work is proactive—it happens on a schedule, with no human triggering it.

到目前为止,我们一直在讨论的智能体,基本都是响应式的。用户发来一条消息,智能体做出回应。但很多真正有价值的智能体工作其实是主动式的,也就是按计划执行,不需要任何人手动触发。

Two patterns in particular:

尤其常见的是两类模式。

  1. Sleep-time compute. Agents that do useful work during idle periods, so users benefit from accumulated thinking rather than on-demand latency. A research agent that runs nightly to catch up on new papers in your field. A prep agent that reviews tomorrow's calendar and drafts briefing notes before you start your day. A triage agent that classifies overnight support tickets so your team walks into a prioritized queue. The work happens while nobody's waiting, and the output is ready when the user shows up.
  1. 睡眠时间计算。 智能体在空闲时段完成有用工作,于是用户获得的是累积思考后的结果,而不是按需等待的延迟。比如一个研究智能体每晚运行一次,跟进你所在领域的新论文。一个准备型智能体在你开始新一天之前,先审阅明天的日历并起草 briefing notes。一个分诊型智能体把隔夜的支持工单先分好类,让团队一上班就能面对排好优先级的队列。整个工作发生在没人等待的时候,而用户一出现,结果已经准备好了。
  1. Health and monitoring loops. Agents that periodically check on something and act (or escalate) if they find an issue. An on-call agent that reviews alerts every fifteen minutes, an agent that monitors your staging environment for regressions, a compliance agent that sweeps for policy violations on a cadence. These need the same durability, tracing, and auth as user-facing runs, but no user is waiting on them.
  1. 健康检查与监控循环。 智能体定期检查某件事,并在发现问题时采取行动或升级上报。比如一个值班智能体每十五分钟审阅一次告警,一个智能体监测你的 staging 环境是否出现回归,一个合规智能体按固定节奏扫描是否有违反策略的情况。这些运行和面向用户的运行一样,同样需要持久性、trace 和 auth,只不过没有人在前台等它。

The Agent Server has cron jobs built in, so scheduled runs get the same durability, tracing, and auth guarantees as any other run—no separate scheduler to maintain, no second observability story to wire up. You pass a standard cron expression (UTC) and an input, and the server triggers runs on schedule.

Agent Server 内置了 cron jobs,因此这些定时运行拥有和其他任何 run 完全相同的持久性、trace 和 auth 保证。你不需要再维护一个独立调度器,也不需要再单独接一套可观测性。你只需传入标准的 cron 表达式(UTC)和输入内容,服务器就会按计划触发运行。

Two flavors fit different patterns:

这里有两种形态,分别适配不同模式。

  1. Stateful cron (client.crons.create_for_thread) ties the schedule to a specific thread_id, so every triggered run appends to the same conversation. This fits agents that should see their own history—a daily research agent that builds on yesterday's findings, or a monitoring agent that remembers what it already flagged.
  1. 有状态 cron,也就是 client.crons.create_for_thread,会把调度绑定到一个特定 thread_id 上,因此每次触发的运行都会追加到同一段对话中。这适合那些应该看见自己历史的智能体,比如一个每天累积昨天研究成果的研究智能体,或者一个记得自己已经标记过什么问题的监控智能体。
  1. Stateless cron (client.crons.create) spins up a fresh thread for each execution, which fits batch-style work that doesn't need continuity between runs. Control thread cleanup via on_run_completed: "delete" (the default) removes the thread when the run finishes, "keep" preserves it for later retrieval via client.runs.search(metadata={"cron_id": cron_id}).
  1. 无状态 cron,也就是 client.crons.create,会为每次执行新建一个线程。这适合那些不需要跨运行连续性的批处理型工作。线程清理由 on_run_completed 控制。"delete" 是默认值,会在运行完成后删除线程;"keep" 则会保留线程,之后可以通过 client.runs.search(metadata={"cron_id": cron_id}) 进行检索。

Every cron run shows up in tracing, respects auth handlers and middleware, and supports resumption on failure—a cron that hits a transient model outage at 3am doesn't silently fail, it gets retried like any other run. One operational note: delete crons when you're done with them. They keep running (and billing) until you do.

每一次 cron run 都会出现在 tracing 里,也会遵守 auth handler 和 middleware,并支持在失败后恢复。比如某个 cron 在凌晨 3 点撞上模型的瞬时故障,它不会悄无声息地失败,而是会像其他 run 一样被自动重试。还有一个运维提醒,事情做完后记得删除 cron。否则它会一直运行,也会一直计费。

We see enterprise teams with varying deployment requirements, so the runtime supports cloud, hybrid, and self-hosted deployments. The capabilities are the same regardless of where you run it.

我们看到企业团队的部署要求并不相同,因此这个 runtime 同时支持云端、混合和自托管部署。无论你运行在哪里,这些能力本身都是一致的。

deepagents deploy

deepagents deploy

deepagents deploy is the packaging step that deploys your agent on the runtime described above. You define your agent in deepagents.toml, and the CLI bundles your configuration and deploys it as a LangSmith Deployment with all of the aforementioned features.

deepagents deploy 是把你的智能体部署到上述 runtime 上的打包步骤。你在 deepagents.toml 中定义智能体,CLI 会把你的配置打包,并将其部署为一个 LangSmith Deployment,连同前面提到的所有能力一起交付。

Memory uses a virtual filesystem with pluggable backends that gives agents both ephemeral scratch space and persistent cross-conversation storage. Deep Agents support memory scoped to users or assistants (or both)!

Memory 使用一个带可插拔后端的虚拟文件系统,为智能体同时提供临时 scratch space 与跨对话持久存储。Deep Agents 支持按用户或按 assistant 进行作用域划分的 memory,也支持两者同时存在。

Sandbox providers (LangSmith Sandboxes, Daytona, Modal, Runloop, or custom) are a single config value. When a sandbox is present, the harness automatically adds an execute tool. Sandbox lifecycle (thread-scoped vs assistant-scoped) is handled through graph factories. Credentials inside sandboxes are managed through the sandbox auth proxy so API keys never appear in sandbox code or logs.

Sandbox providers,包括 LangSmith Sandboxes、Daytona、Modal、Runloop 或自定义方案,只是一个配置值而已。一旦存在 sandbox,harness 就会自动加入一个 execute 工具。sandbox 的生命周期,无论是 thread 级还是 assistant 级,都通过 graph factories 处理。sandbox 内部的凭证则通过 sandbox auth proxy 管理,因此 API key 永远不会出现在沙箱代码或日志中。

Skills and instructions are auto-detected from your skills/ directory and AGENTS.md. MCP servers are picked up from mcp.json. The name field is the only required config value; everything else has sensible defaults.

Skills and instructions 会从你的 skills/ 目录和 AGENTS.md 中自动识别。MCP servers 则会从 mcp.json 中读取。唯一必填的配置项只有 name,其他部分都有合理默认值。

The result is a deployment that can evolve over time, with new skills, tools, and memory policies, without a full rewrite. For the complete set of production considerations (credential management, async patterns, frontend integration, and more), see the going-to-production guide.

最终得到的是一个能够随时间持续演化的部署。你可以逐步加入新的技能、工具和 memory 策略,而不需要整套重写。关于完整的生产环境考量,比如凭证管理、异步模式、前端集成等,请参阅 going-to-production guide。

Open Harness

Open Harness

There's a growing trend in agent infrastructure where moving to a managed solution comes with reduced builder choice—lock-in to a single model provider, a closed harness, or harness functionality hidden behind APIs (like server-side compaction that generates encrypted summaries you can't use outside one ecosystem). The practical consequence is that teams lose visibility into how their agent actually works, and lose the ability to change it when it doesn't.

在智能体基础设施领域,正在出现一个越来越明显的趋势,那就是一旦转向托管方案,构建者的选择权反而会减少。你可能会被锁定在单一模型提供商、封闭的 harness,或者某些被藏在 API 背后的 harness 功能里,比如只能在某个生态中使用的、服务器端 compaction 生成的加密摘要。实际后果就是,团队会逐渐失去对智能体究竟如何工作的可见性,也失去在它工作不对时修改它的能力。

One note on vendor lock-in: deepagents deploy is built to avoid it. The harness is MIT licensed and fully open source, agent instructions use AGENTS.md (an open standard), and agents are exposed via open protocols—MCP, A2A, Agent Protocol. There's no model or sandbox lock-in, and nothing about the harness is a black box. The default harness offers the following capabilities:

关于厂商锁定,还值得专门说一句。deepagents deploy 的设计目标,就是避免它。这个 harness 采用 MIT 许可并完全开源,智能体指令使用 AGENTS.md 这种开放标准,智能体通过开放协议暴露,包括 MCP、A2A、Agent Protocol。这里没有模型锁定,也没有沙箱锁定,harness 的任何部分都不是黑盒。默认 harness 具备如下能力。

Additionally, Deep Agents allows you to inspect, customize, and extend every layer of agent behavior, including rate limits, retry logic, model fallback, PII detection, and file permissions via LangChain's middleware.

此外,Deep Agents 还允许你检查、定制并扩展智能体行为的每一层,包括限流、重试逻辑、模型 fallback、PII 检测,以及通过 LangChain middleware 控制文件权限。

Take your agents to production

把你的智能体带到生产环境

The capabilities this guide outlines—durable execution, memory, multi-tenancy, guardrails, human-in-the-loop, observability, sandboxed code execution, scheduled runs, and more—are the infrastructure requirements production agents can't function without. deepagents deploy packages all of it so teams don't have to assemble it from scratch, and keeps the stack open, configurable, and yours throughout.

本指南列出的这些能力,包括持久执行、记忆、多租户、防护栏、human-in-the-loop、可观测性、沙箱化代码执行、定时运行等等,都是生产级智能体无法缺少的基础设施。deepagents deploy 把这一整套能力打包好,让团队不必从零拼装,同时又让整套栈始终保持开放、可配置,并真正属于你自己。

Building agents is a deeply iterative cycle: traces surface what's actually happening in production, online evals catch regressions before they compound, and memory means the agent gets more useful over time. The infrastructure isn't just supporting the live agent, it's the foundation for making it better.

构建智能体,本质上是一个高度迭代的循环。trace 会暴露生产环境里真实发生的事,online evals 会在回归问题扩散之前把它们拦住,而 memory 则让智能体随着时间推移越来越有用。这套基础设施不只是支撑一个正在运行的智能体,它本身也是让智能体持续变好的根基。

If you want to try it out, the quickstart will get you from deepagents.toml to a running deployment in minutes. For the full production playbook including memory scoping, sandbox lifecycle, credential management, guardrails, and frontend integration, see the going-to-production guide. For a deeper look at the runtime itself, see the LangSmith Deployment and Agent Server docs.

如果你想试一试,quickstart 可以在几分钟内带你从 deepagents.toml 走到一个正在运行的 deployment。若想看完整的生产实践手册,包括 memory 作用域、sandbox 生命周期、凭证管理、防护栏以及前端集成,请参阅 going-to-production guide。若想更深入了解 runtime 本身,请查看 LangSmith Deployment 和 Agent Server 文档。

To build a good agent, you need a good harness. To deploy that agent, you need a good runtime.

The harness is the system you build around the model to help your agent be successful in its domain. That includes prompts, tools, skills, and anything else supporting the model and tool calling loop that defines an agent. The runtime is everything underneath: durable execution, memory, multi-tenancy, observability, the machinery that keeps an agent running in production without your team reinventing it.

This guide walks through the production requirements that surface once you deploy agents, the runtime capabilities that meet them, and how deepagents deploy packages those capabilities into something you can ship.

Runtime capabilities for production agents

Throughout this section, "the runtime" refers to LangSmith Deployment (LSD) and its Agent Server: LSD runs agents in production, and Agent Server is the interface for assistants, threads, runs, memory, and scheduled jobs. The table below maps each production requirement to the runtime primitive that meets it.

https://docs.langchain.com/langsmith/custom-store

Durable execution

Agents work by running a loop: Given a prompt, the model reasons, calls tools, observes the results, and repeats until it decides the task is complete.

https://docs.langchain.com/langsmith/deploy-with-control-plane

Unlike a typical web request that returns in milliseconds, this loop can span minutes or hours. A single run might make dozens of model calls, spawn subagents, or wait indefinitely for a human to approve a draft. A crash, deploy, or transient failure anywhere in that loop shouldn't erase the work leading up to it.

In practice, you feel it in two places:

Long runs need to survive infrastructure failures. A research agent spending twenty minutes gathering sources and synthesizing findings can't afford to restart from scratch if the worker process dies: the agent already paid for the tokens and executed the tool calls. What you want is resumption from the last completed step, with all prior state intact.

Agents need to be able to stop and wait. An agent that pauses for a human to approve a transaction doesn't know if the human will respond in thirty seconds or three days. Tying up a worker process or a client connection for that entire window isn't viable. The agent needs to truly stop: free resources, release workers, then pick up later exactly where it left off.

Both requirements are solved by the same thing: durable execution.

  • Agents run on a managed task queue with automatic checkpointing, so any run can be retried, replayed, or resumed from the exact point of interruption.

  • Each super-step of graph execution writes a checkpoint to the persistence layer (PostgreSQL by default), keyed by a thread_id that acts as a persistent cursor into the run.

  • When a worker crashes, the run's lease is released and another worker picks it up from the latest checkpoint.

  • When an agent waits for human input, the process hands off its slot and the run sleeps indefinitely until resumed.

  • Configurable retry policies control backoff, max attempts, and which exceptions trigger retries on a per-node basis.

Durability is the foundation the rest of this list depends on. Because execution can pause and resume across process boundaries, agents can wait indefinitely for human input, run in the background, survive deploys mid-run, and handle concurrent inputs without corrupting state.

Memory

Agents need two different kinds of memory, and the distinction matters.

Short-term memory is what the agent accumulates within a single conversation. The messages exchanged, the tool calls made, the intermediate state built up across a run. This lives in the checkpoint for the thread, scoped to a thread_id, and disappears (conceptually) when the conversation ends. A follow-up message on the same thread sees everything that came before on that thread.

Long-term memory is what the agent carries across conversations. This can include user preferences learned across conversations, project conventions and best practices, or a knowledge base enhanced with each new query. None of this belongs to any single thread. It's user-level or organization-level context that should persist across every conversation the agent has. Checkpoints alone can't do this, because checkpoint state is scoped to a single thread.

Long-term memory is what the Agent Server's built-in store is for. It's a key-value interface where memories are organized by namespace tuples (for example, (user_id, "memories")) and persisted across threads. Your agent writes to the store in one conversation and reads from it in the next. Backed by PostgreSQL by default, it supports semantic search via embedding configuration so agents can retrieve memories by meaning rather than exact match, and you can swap in a custom backend if you need different storage characteristics. The namespace structure is flexible: scope by user, assistant, organization, or any combination that fits your data model.

Because memory that accumulates over months is some of the most valuable data the system produces, it matters where it lives. The store is queryable directly via API, and if you self-host, it lives in your own PostgreSQL instance. Keeping this data in a standard format you control is what lets you migrate between models, analyze it, or build on top of it outside the agent itself.

Multi-tenancy

The moment your agent serves more than one user, a set of problems appears that didn't exist in single-player mode. These break down into three distinct concerns, and the Agent Server handles each with its own primitive.

Isolating one user's data from another. User A's run should only touch User A's threads, and only read User A's memories. Custom authentication runs as middleware on every request: your @auth.authenticate handler validates the incoming credential and returns the user's identity and permissions, which get attached to the run context. Authorization handlers registered with @auth.on.threads, @auth.on.assistants.create, and so on then enforce who can see or modify what by tagging resources with ownership metadata on creation and returning filter dictionaries on reads. Handlers are matched from most specific to least, so you can start with a single global handler and add resource-specific ones as your model grows.

Letting the agent act on behalf of a user. Agents often need to call third-party services using the user's credentials—reading their calendar, posting to their Slack, opening a PR in their repo. Agent Auth handles the OAuth dance and token storage for this pattern, so the agent gets user-scoped credentials at runtime without you managing the refresh flow yourself. The user authenticates once; the agent can act on their behalf across subsequent runs.

Controlling who can operate the system itself. Separate from end-user access, there's the question of which members of your team can deploy agents, configure them, view traces, or change auth policies. RBAC handles this operator-level access control.

The three layers compose: end users authenticate via your auth handler, the agent calls third-party services via Agent Auth, and your team operates the deployment under RBAC policies.

https://docs.langchain.com/oss/python/langchain/middleware

Human-in-the-loop (HITL)

Agents work by running a loop: given a prompt, a model reasons and decides to call tools, observes the results, and repeats until it decides it has completed the task at hand. Most of the time you want that loop to run uninterrupted. That’s where the value comes from. But sometimes you need a human in the middle of the loop at key decision points.

There are two common situations where this comes up:

  1. Reviewing a proposed tool call. Before the agent executes a consequential action (sending an email, executing a financial transaction, deleting files), you want a human to see exactly what it's about to do and decide how to respond. Take the email case: the agent drafts a message and pauses before sending. You can approve it as-is, edit the subject or body before it goes out, or reject it with a reason and specific edit requests so the agent can revise and try again.

  2. An agent asking a clarifying question. Sometimes an agent reaches a decision point it can't resolve on its own, not because it lacks a tool but because the right answer depends on human judgment or preference. Rather than guessing, the agent can surface the question directly: "I found three config files matching that pattern. Which one should I modify?" or "Should this deploy to staging or production?" Your answer becomes the return value of the interrupt, and the agent continues from exactly where it stopped.

The Agent Server handles this with two primitives: interrupt() pauses execution and surfaces a payload to the caller; Command(resume=...) continues it with the human's response. Together they let you build approval gates, draft review loops, input validation, and any workflow where a human needs to weigh in mid-execution.

https://docs.langchain.com/langsmith/sandbox-warm-pools

Under the hood, interrupt() triggers the runtime's checkpointer to write the full graph state to durable storage, keyed by a thread_id that acts as a persistent cursor. The process then frees resources and waits indefinitely. Unlike static breakpoints that pause before or after specific nodes, interrupt() is dynamic: place it anywhere in your code, wrap it in conditionals, or embed it inside a tool function so approval logic travels with the tool. When Command(resume=...) arrives—minutes, hours, or days later—the resume value becomes the return value of the interrupt() call, and execution picks up exactly where it stopped. Because resume accepts any JSON-serializable value, the response isn't limited to approve/reject: a reviewer can return an edited draft, a human can supply missing context, a downstream system can inject computed results. When parallel branches each call interrupt(), all pending interrupts are surfaced together and can be resumed in a single invocation, or one at a time as responses come back.

Real-time interaction

Human-in-the-loop is an interaction mode where execution can pause for a person to review or provide input—sometimes immediately, sometimes much later. Separately, there are “live session” problems that show up when the agent is actively working while the user is present: making progress visible (streaming) and coordinating concurrent messages (double-texting).

Streaming

An agent that takes thirty seconds to produce a response leaves the user staring at a spinner with no signal about whether it's making progress, stuck, or about to fail. They also can't start reading the answer until the whole thing is done. Streaming solves both: partial output flows to the client as the agent produces it, so the user sees the response materialize in real time.

The Streaming API supports several modes depending on what granularity you want: full state snapshots after each graph step, state updates only, token-by-token LLM output, or custom application events. You can also combine them. Run streaming (client.runs.stream()) is scoped to a single run; thread streaming (client.threads.joinStream()) opens a long-lived connection that delivers events from every run on a thread, useful when follow-up messages, background runs, or HITL resumptions all trigger activity on the same thread.

Thread streaming supports resumption via the Last-Event-ID header: the client reconnects with the ID of the last event it received, and the server replays from there with no gaps. Without this, every dropped connection means the client either misses output or has to start over.

Double-texting

The second real-time problem: a user sends a new message while the agent is still working on the previous one. This happens constantly in chat UIs. Someone types a question, realizes they meant something slightly different, and fires off a correction before the first run finishes. We call this double-texting, and the runtime has to take a position on how to handle it.

There are four strategies, and the right one depends on your application:

  • enqueue (the default): The new input waits for the current run to finish, then processes sequentially.

  • reject: Refuse any new input until the current run finishes.

  • interrupt: Halt the current run, preserve progress, and process the new input from that state. Useful when the second message builds on the first.

  • rollback: Halt the current run, revert all progress including the original input, and process the new message as a fresh run. Useful when the second message replaces the first.

https://docs.langchain.com/oss/python/langchain/middleware/built-in

interrupt gives the snappiest chat feel but requires your graph to handle partial tool calls cleanly (a tool call initiated but not completed when the interrupt hits may need cleanup on resume). enqueue is the safest default—no state corruption, at the cost of making the user wait.

Guardrails

Not every production concern can be expressed as "run the loop durably." Some have to shape the loop itself: intercepting model inputs, filtering tool outputs, enforcing limits on expensive operations. These policies belong in code, not in a prompt. They need to run every time, not whenever the model happens to remember them.

Two cases make this concrete:

  1. Redacting sensitive data before the model sees it. A customer support agent processes user messages containing PII (names, emails, account numbers). You don't want the model to see them, you don't want them in traces, and compliance likely requires redaction before logging. This has to happen before every model call, deterministically.

  2. Capping expensive operations. An agent that can call a paid external API needs a hard ceiling on how many calls it makes per run, because a confused model will otherwise happily call it fifty times and burn through your budget before lunch.

Both are handled by middleware, which wraps the agent loop at defined hooks—before_model, wrap_model_call, wrap_tool_call, after_model—so policies execute deterministically around every relevant step.

https://docs.langchain.com/oss/python/langchain/middleware/built-in

LangChain ships built-in middleware covering the common cases: PIIRedactionMiddleware, ModelRetryMiddleware, ModelFallbackMiddleware, ToolCallLimitMiddleware, SummarizationMiddleware, HumanInTheLoopMiddleware, OpenAIModerationMiddleware, and you can write custom middleware for application-specific policies.

Middleware is open source, but it only really pays off when it runs inside the agent runtime. When it does, those same policies become part of every interaction mode the runtime supports—streaming, human-in-the-loop pauses/resumes, retries, background runs, and long-lived threads. In practice, that means your guardrails and instrumentation aren’t “best effort”: they consistently wrap every model call and every tool call, at the exact points you expect, no matter what the agent is doing.

Observability

You don't know what an agent will do in production until you run it. Unlike a traditional application where you can reason about behavior from the code, an agent's execution path depends on the model's choices at runtime: which tools to call, what to pass them, how to interpret the results, and when to give up and try something else. When something goes wrong, you can't just re-read the function. You need to see what actually happened.

A support ticket says "the agent kept asking the same question over and over." Without traces, you're guessing from the user's description. With traces, you see the full execution tree: the user's message, the model's planned response, the tool it called, the result it got back, the next message it generated, the loop it fell into. You can filter by cost to find runs that burned through tokens, by error to find runs that failed, by user to see what a specific customer experienced. You can spot patterns across thousands of runs that no individual trace would reveal.

Every LangSmith Deployment is automatically wired to a tracing project. You get the full execution tree out of the box—model calls, tool calls, subagent runs, middleware hooks—with structured metadata you can query by user, time window, cost, latency, error state, feedback, or custom tags.

Traces aren't just a debugging tool; they're the foundation of the improvement loop:

https://docs.langchain.com/langsmith/human-in-the-loop-time-travel

Polly, the LangSmith AI assistant, analyzes traces and surfaces insights—common failure modes, slow tool calls, repeated patterns—so you're not reading thousands by hand. Online Evals run LLM-as-judge or custom scorers against production traces automatically, so regressions get caught as they happen. We used this loop to improve Deep Agents by 13.7 points on Terminal Bench 2.0 by only changing the harness—the whole argument for why the agent improvement loop starts with a trace is worth reading in full.

Time travel

Observability tells you what happened. Time travel lets you ask what would have happened if something had gone differently.

The motivating case is debugging a run that went off the rails. Your agent made a bad decision at step 5 of a 20-step run: it called the wrong tool, misread a tool result, or asked a clarifying question when it should have kept going. You want to understand why, and you want to try alternatives without re-running the whole thing from scratch. More generally, any time an agent's path depends on state at a particular checkpoint, you want the ability to rewind to that checkpoint, change the state, and let the rest of the run unfold differently.

Because every super-step writes a checkpoint, every point in a run's history is already a snapshot you can return to. Time travel makes this explicit: pick a checkpoint from a thread's history, optionally modify its state, and resume from there. The modified checkpoint forks the thread's history. The original stays intact, and the new path runs forward as its own branch. LLM calls, tool calls, and interrupts all re-trigger on replay, so forks exercise the real agent loop rather than a stub of it.

This unlocks patterns that are hard to build otherwise: debugging why the agent chose tool A when it should have chosen tool B, comparing two prompts against the same upstream context, recovering from a run that went sideways by rewinding to the last good state, or exploring counterfactuals across many forks to understand model behavior. The LangSmith Studio UI gives you a visual interface for all of this; the API is what most production debugging workflows end up using.

Code execution

An agent that can only call the tools you pre-wired is limited to what you anticipated. An agent that can run arbitrary code is general-purpose: it can install dependencies, clone repos, execute tests, run data analysis, generate documents, and render plots. This is the gap between "chatbot with function calling" and "agent that can actually do things."

Arbitrary code execution requires isolation. If the agent runs rm -rf / on your host, you have a bad day. If it reads your environment variables, it exfiltrates your API keys. You need a boundary between the agent's execution environment and everything you care about, and you need it before the agent writes its first command.

In Deep Agents, isolation happens through sandbox backends. When you configure a backend that implements SandboxBackendProtocol, the agent automatically gets an execute tool for running shell commands in the sandbox alongside the standard filesystem tools. Without a sandbox backend, the execute tool isn't even visible to the agent. Supported providers include Daytona, Modal, Runloop, and LangSmith Sandboxes, and you can swap between them with a single configuration change.

LangSmith Sandboxes (currently in private preview) are worth a specific callout because they're built to integrate with the rest of the runtime. Templates define container images, resource limits, and volumes declaratively. Warm pools pre-provision sandboxes with automatic replenishment, eliminating cold start latency for interactive agents. And the auth proxy solves a problem every team hits eventually: the agent needs to call authenticated APIs, but putting credentials inside the sandbox is a security risk. The proxy runs as a sidecar, intercepts outbound requests, and injects credentials from workspace secrets automatically—the sandbox code calls api.openai.com with no headers, and the proxy adds the right Authorization header on the way out. Secrets never enter the sandbox, and the agent can't exfiltrate what it can't see.

https://a2a-protocol.org/

One piece of security guidance worth repeating: sandboxes protect your host, not the sandbox itself. An attacker who controls the agent's input (via prompt injection in a scraped webpage, a malicious email, a poisoned tool result) can instruct the agent to run commands inside the sandbox. The sandbox keeps the attacker off your machine, but anything inside the sandbox—including credentials placed there directly—is compromised. The auth proxy pattern exists for exactly this reason.

Integrations

Agents are most useful when they plug into the systems people and organizations already use. A coding agent becomes more powerful when it can reach into GitHub, Linear, and your CI system. A research agent becomes more useful when its output feeds into your publishing pipeline. An internal agent becomes a platform when other agents can call it as a building block. If every one of those integrations is a hand-rolled adapter, your agents stay isolated. The boundary between "agent" and "everything else" becomes a wall.

Open protocols solve this by letting agents and external systems discover and talk to each other without either side knowing the other's implementation. The Agent Server provisions three integration surfaces automatically.

MCP

MCP (Model Context Protocol) is the open standard for connecting agents to tools and data sources. Every LangSmith Deployment automatically exposes an MCP endpoint, making your agent discoverable by any MCP-compliant client—Claude Desktop, IDEs, other agents, custom applications—without you writing adapter code. In the other direction, your agent can call out to any MCP server (Linear, GitHub, Notion, and hundreds of others) to reach tools and data your users already have.

A2A

A2A (Agent-to-Agent) is the analogous standard for agent-to-agent communication, and every deployment exposes an A2A endpoint automatically as well. This is what makes multi-agent architectures across deployments tractable: an orchestrator agent in one deployment can discover and call worker agents in another using a protocol both sides understand, with no hand-rolled HTTP contracts.

Webhooks

Webhooks handle the outbound case: your agent finishes a run, and you want to kick off something downstream without polling. Pass a webhook URL when creating a run, and the server POSTs the run payload to that URL on completion. This is how you chain agent runs into existing workflows—a research run completes and triggers a publishing pipeline, a daily summary finishes and notifies Slack, a compliance check completes and writes to your audit log. Headers, domain allowlists, and HTTPS enforcement are all configurable for production environments.

Cron

The agents we've been talking about so far are reactive: a user sends a message, the agent responds. But a lot of valuable agent work is proactive—it happens on a schedule, with no human triggering it.

Two patterns in particular:

  1. Sleep-time compute. Agents that do useful work during idle periods, so users benefit from accumulated thinking rather than on-demand latency. A research agent that runs nightly to catch up on new papers in your field. A prep agent that reviews tomorrow's calendar and drafts briefing notes before you start your day. A triage agent that classifies overnight support tickets so your team walks into a prioritized queue. The work happens while nobody's waiting, and the output is ready when the user shows up.

  2. Health and monitoring loops. Agents that periodically check on something and act (or escalate) if they find an issue. An on-call agent that reviews alerts every fifteen minutes, an agent that monitors your staging environment for regressions, a compliance agent that sweeps for policy violations on a cadence. These need the same durability, tracing, and auth as user-facing runs, but no user is waiting on them.

The Agent Server has cron jobs built in, so scheduled runs get the same durability, tracing, and auth guarantees as any other run—no separate scheduler to maintain, no second observability story to wire up. You pass a standard cron expression (UTC) and an input, and the server triggers runs on schedule.

Two flavors fit different patterns:

  1. Stateful cron (client.crons.create_for_thread) ties the schedule to a specific thread_id, so every triggered run appends to the same conversation. This fits agents that should see their own history—a daily research agent that builds on yesterday's findings, or a monitoring agent that remembers what it already flagged.

  2. Stateless cron (client.crons.create) spins up a fresh thread for each execution, which fits batch-style work that doesn't need continuity between runs. Control thread cleanup via on_run_completed: "delete" (the default) removes the thread when the run finishes, "keep" preserves it for later retrieval via client.runs.search(metadata={"cron_id": cron_id}).

https://www.langchain.com/blog/introducing-langsmith-sandboxes-secure-code-execution-for-agents

Every cron run shows up in tracing, respects auth handlers and middleware, and supports resumption on failure—a cron that hits a transient model outage at 3am doesn't silently fail, it gets retried like any other run. One operational note: delete crons when you're done with them. They keep running (and billing) until you do.

We see enterprise teams with varying deployment requirements, so the runtime supports cloud, hybrid, and self-hosted deployments. The capabilities are the same regardless of where you run it.

deepagents deploy

deepagents deploy is the packaging step that deploys your agent on the runtime described above. You define your agent in deepagents.toml, and the CLI bundles your configuration and deploys it as a LangSmith Deployment with all of the aforementioned features.

https://docs.langchain.com/langsmith/double-texting

Memory uses a virtual filesystem with pluggable backends that gives agents both ephemeral scratch space and persistent cross-conversation storage. Deep Agents support memory scoped to users or assistants (or both)!

Sandbox providers (LangSmith Sandboxes, Daytona, Modal, Runloop, or custom) are a single config value. When a sandbox is present, the harness automatically adds an execute tool. Sandbox lifecycle (thread-scoped vs assistant-scoped) is handled through graph factories. Credentials inside sandboxes are managed through the sandbox auth proxy so API keys never appear in sandbox code or logs.

Skills and instructions are auto-detected from your skills/ directory and AGENTS.md. MCP servers are picked up from mcp.json. The name field is the only required config value; everything else has sensible defaults.

The result is a deployment that can evolve over time, with new skills, tools, and memory policies, without a full rewrite. For the complete set of production considerations (credential management, async patterns, frontend integration, and more), see the going-to-production guide.

Open Harness

There's a growing trend in agent infrastructure where moving to a managed solution comes with reduced builder choice—lock-in to a single model provider, a closed harness, or harness functionality hidden behind APIs (like server-side compaction that generates encrypted summaries you can't use outside one ecosystem). The practical consequence is that teams lose visibility into how their agent actually works, and lose the ability to change it when it doesn't.

One note on vendor lock-in: deepagents deploy is built to avoid it. The harness is MIT licensed and fully open source, agent instructions use AGENTS.md (an open standard), and agents are exposed via open protocols—MCP, A2A, Agent Protocol. There's no model or sandbox lock-in, and nothing about the harness is a black box. The default harness offers the following capabilities:

https://docs.langchain.com/langsmith/online-evaluations-llm-as-judge

Additionally, Deep Agents allows you to inspect, customize, and extend every layer of agent behavior, including rate limits, retry logic, model fallback, PII detection, and file permissions via LangChain's middleware.

Take your agents to production

The capabilities this guide outlines—durable execution, memory, multi-tenancy, guardrails, human-in-the-loop, observability, sandboxed code execution, scheduled runs, and more—are the infrastructure requirements production agents can't function without. deepagents deploy packages all of it so teams don't have to assemble it from scratch, and keeps the stack open, configurable, and yours throughout.

Building agents is a deeply iterative cycle: traces surface what's actually happening in production, online evals catch regressions before they compound, and memory means the agent gets more useful over time. The infrastructure isn't just supporting the live agent, it's the foundation for making it better.

If you want to try it out, the quickstart will get you from deepagents.toml to a running deployment in minutes. For the full production playbook including memory scoping, sandbox lifecycle, credential management, guardrails, and frontend integration, see the going-to-production guide. For a deeper look at the runtime itself, see the LangSmith Deployment and Agent Server docs.

📋 讨论归档

讨论进行中…