返回列表
🧠 阿头学 · 💬 讨论题

智能体开发生命周期不是锦上添花,而是从 Demo 到生产的分水岭

这篇文章的核心判断是对的:智能体真正难的不是做出来,而是建立“构建—测试—部署—监控”的闭环;但它明显借方法论在推 LangChain 生态,通用框架有价值,具体产品绑定则带营销色彩。
打开原文 ↗

2026-05-10 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 闭环比首版更重要 文章最站得住的判断是,智能体团队的竞争力不在于首个 demo 多惊艳,而在于能否把生产中的失败持续沉淀成测试集、评估和下一版改进;这个判断比“多智能体”“自动化革命”之类口号扎实得多。
  • 测试必须早于上线 作者坚持“先有基础 eval 再部署”是正确的,因为没有预先评估就上线,本质上是在拿用户做盲测;尤其智能体涉及工具调用、状态变化和多轮交互时,单靠线上反馈补救通常代价更高。
  • Trace 是智能体时代的一等基础设施 文章准确指出,传统监控只看延迟、报错和可用性已经不够,智能体即使技术上成功返回,也可能业务上失败;因此 trace 不是可选日志,而是定位错误工具调用、错误上下文和错误流程的必要条件。
  • 部署不是“扔到服务器上” 对长时运行、可暂停恢复、需要人工审批、可能写文件或执行代码的智能体而言,运行时、沙箱和上下文管理确实比传统 Web 部署更关键;如果还用无状态服务思维理解 agent,产品很容易在真实场景里崩。
  • 治理是规模化后的硬门槛 成本、权限、审计、human-in-the-loop 和资产复用不是大公司才有的官僚问题,而是多团队、多智能体之后必然出现的失控点;文章在这点上是现实主义的,但对安全与合规的展开仍然不够。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 这意味着做 agent 不该再把“prompt 调通”当完成,而该先定义最小闭环:任务数据集、基础 eval、trace 采集、上线后反馈回流;下一步可以用这套框架审视手头项目,先补“监控和评估”短板,而不是继续堆功能。
  • 对 Neta 意味着什么、下一步怎么用 这意味着判断一个团队是否真有 agent 能力,不能只看 demo 和模型选型,而要看它有没有持续学习系统;下一步可以把“是否有 trace→eval→dataset 回流机制”作为技术尽调或合作判断的核心问题。
  • 对 Uota 意味着什么、下一步怎么用 这意味着“智能体体验差”很多时候不是模型笨,而是产品没有运行时、上下文管理和人工兜底设计;下一步讨论产品方案时,应该先问失败怎么发现、怎么暂停、怎么纠正,而不是先问能不能全自动。
  • 对三者共同意味着什么、下一步怎么用 这篇文章最有用的不是推荐具体工具,而是提供了一个检查表:构建层、测试层、部署层、监控层、治理层是否各自有人负责;下一步最有效的动作是拿这个检查表给现有项目做一次残酷体检。

讨论引子

1. 对大多数团队来说,真正必要的是“智能体开发生命周期”,还是一个被过度包装的 AgentOps 销售叙事? 2. Trace 和 eval 到底能解决多少问题,哪些失败其实来自底层模型能力不足、目标定义模糊或外部系统不稳定? 3. 对中小团队而言,什么时候该上运行时、沙箱、治理平台,什么时候这些基建反而会拖慢速度?

每个人都想把智能体真正交付上线。

最优秀的组织已经摸清了如何反复地、安全地、系统化地做到这一点。它们尽早发布,从真实使用中学习,并快速迭代。它们不会把智能体当成一次性的演示,或彼此孤立的项目。

相反,它们建立了一套智能体开发生命周期,把实验转化为一套可重复的系统,持续完成交付、学习和改进,由此形成前进的势能。

这个生命周期分为四个部分。

构建 → 测试 → 部署 → 监控

https://www.langchain.com/blog/traces-start-agent-improvement-loop

这个顺序是有意安排的。测试应该在智能体进入生产环境之前就开始,而不是之后。团队需要先在部署前测试智能体,再以可控方式部署它们,监控它们在生产环境中的行为,并把这些经验反馈到下一轮构建和评估周期中。

对于单个智能体,这个流程可以保持轻量。可一旦扩展到许多智能体,它就会变成基础设施和治理层面的挑战。团队需要共享的方法来控制成本、管理工具访问、检查工具调用、复用上下文,并决定哪些地方需要人类介入。

让一个智能体只成功运行一次,和把开发智能体变成一种可重复实践,这两者之间的差别,就在于是否具备正确的开发生命周期。

构建

在构建阶段,团队要决定自己在打造什么样的智能体系统,以及希望采用哪一层抽象。

这里的工具形态非常丰富。有些工具以代码为先,有些则是无代码或低代码。有些聚焦抽象层,有些则聚焦为智能体提供可工作的环境,包括提示词、工具、技能和状态。

https://docs.langchain.com/langgraph-platform

在代码优先这一侧,团队通常会选择开源框架和执行支撑工具。在 LangChain 生态中,这包括 LangChain、LangGraph 和 Deep Agents。LangChain 之外的例子则包括 CrewAI 和 Claude Agents SDK。

这些工具运行在技术栈的不同层级。

智能体框架主要关注抽象。它们帮助开发者编排模型调用、工具、提示词、检索、结构化输出和智能体循环。LangChain 和 CrewAI 就属于这一类。

智能体运行时主要关注执行。它们支持那些需要状态、控制流、持久性和人工干预的智能体。LangGraph 是 LangChain 生态里最清晰的例子。它提供了一种方式来构建具备智能体特征的系统,这些系统可以分支、循环、暂停、恢复,并持续持久化状态。

智能体执行支架主要关注把事情做成。它们为智能体处理长时任务提供外围结构,包括提示词、技能、MCP servers、hooks、middleware,有时还包括文件系统。Deep Agents 和 Claude Agent SDK 就属于这种模式。

这些区别很重要,因为构建一个智能体这件事本身可以有不同含义。

对于一个简单应用,这可能只意味着定义一个工具调用循环。对于更复杂的智能体,这可能意味着编写提示词、定义技能、连接 MCP servers、配置 middleware,并设置智能体能够持续检索或更新的上下文。

无代码构建

构建阶段也有无代码和低代码的一面。像 LangSmith Fleet、Claude Cowork 和 n8n 这样的工具,让更多人可以参与智能体开发。这很重要,因为真正理解所需工作流的人,并不总是写代码的人。

与此同时,无代码工具并不会消除工程控制的必要性。随着系统变得更复杂,团队通常还是需要用代码扩展或覆盖行为。hooks 和 middleware 在这里尤其重要,因为它们让团队可以围绕工具调用、上下文处理、审批、鉴权或业务规则添加自定义逻辑,而不必从头重建每一个智能体。

最好的构建环境,会让简单的事情足够简单,也让复杂的事情成为可能。它们允许领域专家编辑提示词、技能和上下文,同时仍然把那些必须可靠、可测试、可治理的部分交给工程师控制。

测试

在智能体部署之前,团队需要有办法判断它是否真的已经准备好了。

这并不意味着在任何人使用智能体之前,就先构建一套完美的评估体系。现实里,这几乎从来不切实际。但这确实意味着,要先有足够的评估,用来捕捉明显失败、比较不同版本,并避免盲目发布变更。

大多数评估流程,都是从一小批具有代表性的任务数据集开始。有些样本来自预期用例,有些则来自人工测试、内部试用、支持工单、历史 trace,或已知边界情况。随着时间推移,生产环境中的 traces 会让这些数据集变得更强,但测试应该在上线前就开始。

数据集与指标

数据集,是团队保存所学经验的方式。没有数据集,同样的失败往往会在提示词变更、模型升级或工具更新之后再次出现。

正确的指标取决于任务本身。

有些情况下,存在明确的标准答案。智能体有没有提取出正确的值。有没有选择正确的标签。有没有更新正确的字段。这类任务可以直接按正确性衡量。

还有些时候,并不存在唯一的标准答案。智能体可能需要撰写回复、总结对话、判断是否该升级处理,或通过多条都合理的路径完成任务。在这些情况下,团队会更多依赖基于标准的评估。问题会变成,回复是否有依据,智能体是否遵守策略,是否提出了澄清问题,或者是否在没有多余工具调用的情况下高效完成了任务。

实验

实验,是把数据集和指标真正连接到迭代上的环节。它让团队可以在同一套评估集上,比较提示词、模型、检索策略、工具 schema 和编排模式。随着时间推移,这些实验会显示出智能体是在进步,还是在退化。

目标不是在第一天就做出一套完美的评估体系。目标是先从一套有用的体系开始,然后持续改进它。最有价值的评估数据集,往往来自最难的样本,先来自开发阶段和内部试用,之后再来自生产环境。

模拟

模拟也是测试中的另一个重要部分。

许多智能体都是多轮系统。它们不只是回答一个问题,而是要展开对话、收集信息、调用工具、更新状态,并从模糊情况中恢复。对于这些智能体来说,单轮评估远远不够。团队需要多轮评估,以及模拟的端到端交互。

语音智能体是一个显而易见的例子,但这种模式远不止于此。任何跨多个轮次运行的智能体,都可能需要模拟。一个客服智能体可能需要应对情绪激动的客户,提出追问,检查订单状态,并判断是否有必要升级处理。一个编程智能体可能需要检查代码仓库、做出修改、运行测试,并根据反馈作出响应。一个内部运营智能体可能需要在执行操作前先收集缺失信息。

良好的测试实践,能帮助团队以系统化方式改进智能体,而不是依赖感觉。它把预期行为变成数据集,把数据集变成实验,再把实验变成更好的系统版本。部署之后,监控又会提供真实世界里的样本,让这些评估变得更强。

部署

一旦智能体被构建并完成评估,它就需要一个能够稳定运行的环境。

对于简单智能体,部署可能看起来和传统应用部署差不多。但许多智能体需要的不只是一个无状态服务器。它们会持续运行较长时间、调用工具、等待人工输入、写入文件、从中断中恢复,并在多次交互或任务之间维持状态。

这就是为什么运行时很重要。

一个生产级智能体运行时,通常需要支持持久执行和 human-in-the-loop 模式。持久执行意味着智能体可以检查点保存进度,并在出现故障时恢复,而不是丢失工作。human-in-the-loop 意味着智能体在需要审批、澄清或审查时可以暂停。

这方面已经有现成方案。LangSmith Deployment 提供了部署和管理 Deep Agents 与 LangGraph agents 的基础设施。AWS AgentCore 是另一个托管式智能体运行时的例子。还有一些团队会在 Temporal 这样的系统上自己构建运行时,尤其是当它们本来就已经在技术栈的其他地方用 Temporal 处理长时工作流时。

沙箱

很多智能体还需要专用的执行环境。

智能体越来越常需要编写代码、执行代码、检查文件、转换文档,或与文件系统交互。在这些情况下,团队需要决定这些工作在哪里发生。沙箱是一种常见方案。它们提供带有文件系统访问能力的隔离执行环境,同时降低错误或不安全行为带来的影响范围。

例子包括 LangSmith Sandboxes、Daytona 和 E2B。

并不是每个智能体都需要完整沙箱。在某些情况下,智能体只需要一个可以存取文件的地方。一个虚拟文件系统就够了。Deep Agents 支持这种模式,让智能体把文件当成工作记忆来使用,而不一定要在沙箱里执行任意代码。在底层,这个文件系统可能由 Postgres 或 S3 之类的系统支撑。

上下文中心

部署中另一个常被忽视的部分,是提示词和上下文的管理。

智能体最重要的一些部分,并不是传统意义上的应用代码。提示词、检索上下文、技能和任务指令,可能比应用本身更频繁地变化。它们也可能需要由不是工程师的人来编辑。

这就带来了提示词中心或上下文中心的需求,也就是一个用来存储、版本化、审查和更新智能体非代码部分的地方。这样一来,团队无需完整部署就能调整智能体行为,也能让领域专家负责他们最了解的上下文。

在实际中,部署不只是把智能体放到服务器上。它更是在为智能体配齐它完成真实工作所需要的运行时、执行环境和上下文管理系统。

监控

一旦智能体部署完成,团队就需要看清它在生产环境里到底是怎么表现的。

这正是监控智能体和监控传统软件不同的地方。延迟、成本、错误率和可用性这些指标仍然重要,但它们只是一部分。智能体即便返回了一个技术上成功的响应,仍然可能在任务本身上失败。它可能调用了错误的工具,依赖了错误的上下文,跳过了必要的审批步骤,或者给出一个听起来合理但其实错误的答案。

要理解这些失败,团队就需要 traces。

一条 trace 记录了智能体的完整轨迹,包括它接收到的输入、发起的模型调用、调用的工具、收到的输出,以及最终产出的响应或动作。这才是理解智能体实际做了什么所需要的细节层级。

这也是为什么我们一直认为,智能体可观测性是智能体评估的动力来源,也为什么智能体改进循环是从一条 trace 开始的。如果你看不到这条轨迹,就无法可靠地调试它的行为,也无法把这些失败转化成未来的评估样本。

信号

监控还应该包括从这些 traces 中提取信号。

其中一些信号可以来自 LLM-as-judge evaluators。比如,一个 judge 可以给出评分,判断智能体是否回答了用户问题、是否遵守策略、是否使用了合适语气、是否完成了任务。另一些信号则可以更简单。一个 regex 就可以捕捉是否出现了必需短语,是否调用了被禁止的工具,或者是否出现了某种已知失败模式。

这些信号的用途不只是质量检查。它们还可以成为一种产品分析形式。它们可以告诉你,用户在让智能体做哪些任务,智能体卡在了哪里,用户多久会纠正它一次,以及用户会在哪些地方感知到错误。

反馈

反馈也是监控中的另一个核心部分。

光保存 traces 还不够。团队还需要把反馈和这些 traces 一起保存。反馈可以来自 LLM judges、基于 regex 的信号、人工审查员,或通过 API 收集的直接用户反馈。比如在 LangSmith 中,团队可以把用户反馈直接附加到底层 run 上,这样就更容易把用户不满意,和智能体在前三步用了错误工具这样的事实关联起来。

仪表盘

最后,团队还需要仪表盘和告警,用来展示随时间变化的趋势。

一个有用的智能体仪表盘,会跟踪使用量、反馈、延迟、成本、工具调用、评估器得分,以及反复出现的失败模式等指标。告警则应该在重要阈值被突破时触发,比如延迟上升、成本增加、工具失效、用户反馈下降,或策略违规激增。

好的监控,不只是知道系统是否在线。它更是在理解智能体是否在做正确的事,是否以正确的方式在做,并且是否在持续变好。

最强的监控系统,会直接把结果反馈回评估。重要的 traces 会变成数据集样本,反复出现的失败会变成指标,而生产环境中的行为则会成为下一轮改进的基础。

迭代

最优秀的组织,会快速而系统地跑完智能体开发生命周期。

它们不会等到智能体完美才发布。相反,它们会先构建出有用的东西,做足够的测试来理解它的行为,以可控方式部署它,监控它在生产环境中的表现,再把这些经验反馈到下一个版本。

这并不意味着草率发布。关键在于可见性。

拥有数据集、实验、tracing、反馈和仪表盘的团队,可以直接从真实使用中学习。它们可以在大范围推广之前测试变更,找出生产环境里哪里坏了,把失败转成评估,再在不靠猜测的情况下改进智能体。

这就是团队如何一步步爬坡,也是在这个过程中,智能体系统如何持续进步。

最有效的团队,会找出那些困难样本,理解智能体为什么失败,然后调整提示词、工具配置、检索策略、模型、middleware 或工作流。它们会重新运行 evals,部署更好的版本,而监控会继续给它们带来下一批边界情况和失败样本。

在企业内部,挑战在于如何让这个循环在不同团队之间也能重复运行。

如果每个团队都得从零构建自己的评估框架、部署基础设施、tracing 系统、反馈管道和仪表盘,智能体开发就会推进得很慢。最有效的组织会投资共享基础设施,让团队能够穿过整个生命周期,而不用反复重造底层系统。

这正是智能体开发生命周期之所以成为一种运营实践的原因。

治理

治理覆盖整个智能体开发生命周期。

对于单个智能体,轻量控制可能就够了。随着组织部署越来越多的智能体,治理就变得必要。没有治理,团队很快就会面对一些难以发现、难以监控、运行昂贵,而且权限边界不清的智能体。

https://www.langchain.com/langgraph

成本

第一个治理挑战是成本。

智能体可能会变得昂贵,因为它可能涉及多次模型调用、超长上下文窗口、反复使用工具、重试,或者运行很长时间。组织需要通过预算、使用监控、告警,以及对究竟是哪些智能体、团队、模型或工具在推高成本的可见性,来追踪和管理这部分支出。

工具访问

第二个治理挑战是工具访问。

智能体之所以有用,是因为它们能采取行动,但这也引入了风险。团队需要围绕智能体可以访问哪些工具、在什么条件下访问、以及代表哪些用户访问,建立清晰的控制机制。

这正是审计轨迹变得重要的地方。如果一个智能体调用了工具,组织应该能够检查到底是哪个智能体发起了调用,它使用了什么输入,产出了什么输出,以及是哪位用户或哪条策略授权了这次操作。工具调用往往是智能体行为真正影响业务的地方,因此它们必须可观测、可审查。

human-in-the-loop 也是另一种重要的治理机制。

并不是每一次工具调用都应该完全自动化。有些操作应该暂停,等待人工审查,尤其是当它们涉及客户、财务系统、敏感数据或生产基础设施时。human-in-the-loop 工作流,只有从系统一开始设计时就被纳入,效果才最好。

可发现性

第三个治理挑战是可发现性与复用。

随着组织构建越来越多的智能体,也会积累越来越多可复用资产,比如提示词、技能、工具、检索源、策略,甚至其他智能体。没有良好的发现和治理机制,团队往往会反复重建这些组件,最终导致不一致。共享上下文和共享智能体,必须能被找到、被复用,并且被治理。

这一点对技能尤其重要。一个技能可以编码一个工作流、一种写作风格、一个领域专用流程,或者一套使用工具的指令。如果某个团队已经做出了一个好技能,另一个团队就应该能够找到它,而不是从零重写一个新版本。

好的治理,不是为了拖慢团队速度。它是为了让快速迭代在系统规模扩大时,依然不会失去可见性、控制力和一致性。

结论

最优秀的组织已经开始这样运作了。它们会尽早发布,但不会盲目发布。它们会在部署前评估,在部署后监控行为,并持续利用所学经验让下一个版本变得更好。

这正是智能体开发之所以能够被重复执行的原因。也正因如此,智能体才能从演示走向可靠的生产系统。

链接 http://x.com/i/article/2053156837453959169

Everyone wants to ship agents.

The best organizations have figured out how to do it repeatedly, safely, and systematically. They ship early, learn from real usage, and iterate quickly. They don’t treat agents as one-off demos or isolated projects.

Instead, they’ve built an agent development lifecycle that creates momentum by turning experimentation into a repeatable system for shipping, learning, and improving over time.

That lifecycle has four parts:

Build → Test → Deploy → Monitor

https://www.langchain.com/blog/traces-start-agent-improvement-loop

The order is intentional. Testing should start before an agent reaches production, not after. Teams need to test the agents before deployment, deploy them in a controlled way, monitor how they behave in production, and feed those learnings back into the next build and evaluation cycle.

For a single agent, this process can stay lightweight. Across many agents, it becomes an infrastructure and governance challenge. Teams need shared ways to control cost, manage tool access, inspect tool calls, reuse context, and decide where humans need to be involved.

The difference between getting an agent to work once to building agents as a repeatable practice comes from having the right development lifecycle in place.

每个人都想把智能体真正交付上线。

最优秀的组织已经摸清了如何反复地、安全地、系统化地做到这一点。它们尽早发布,从真实使用中学习,并快速迭代。它们不会把智能体当成一次性的演示,或彼此孤立的项目。

相反,它们建立了一套智能体开发生命周期,把实验转化为一套可重复的系统,持续完成交付、学习和改进,由此形成前进的势能。

这个生命周期分为四个部分。

构建 → 测试 → 部署 → 监控

https://www.langchain.com/blog/traces-start-agent-improvement-loop

这个顺序是有意安排的。测试应该在智能体进入生产环境之前就开始,而不是之后。团队需要先在部署前测试智能体,再以可控方式部署它们,监控它们在生产环境中的行为,并把这些经验反馈到下一轮构建和评估周期中。

对于单个智能体,这个流程可以保持轻量。可一旦扩展到许多智能体,它就会变成基础设施和治理层面的挑战。团队需要共享的方法来控制成本、管理工具访问、检查工具调用、复用上下文,并决定哪些地方需要人类介入。

让一个智能体只成功运行一次,和把开发智能体变成一种可重复实践,这两者之间的差别,就在于是否具备正确的开发生命周期。

Build

The build phase is where teams decide what kind of agent system they are creating and what level of abstraction they want to use.

There is a wide range of tooling here. Some tools are code-first, while others are no-code or low-code. Some focus on abstractions, while others focus on giving agents a working environment with prompts, tools, skills, and state.

https://docs.langchain.com/langgraph-platform

On the code-first side, teams often reach for open-source frameworks and harnesses. In the LangChain ecosystem, that includes LangChain, LangGraph, and Deep Agents. Outside of LangChain, examples include CrewAI and Claude Agents SDK.

These tools operate at different layers of the stack.

Agent frameworks focus primarily on abstractions. They help developers compose model calls, tools, prompts, retrieval, structured outputs, and agent loops. LangChain and CrewAI are examples in this category.

Agent runtimes focus on execution. They support agents that need state, control flow, durability, and human intervention. LangGraph is the clearest example in the LangChain ecosystem. It gives you a way to build agentic systems that can branch, loop, pause, resume, and persist state over time.

Agent harnesses focus on doing. They provide the surrounding structure agents need for longer-running tasks: prompts, skills, MCP servers, hooks, middleware, and sometimes a filesystem. Deep Agents and the Claude Agent SDK are examples of this pattern.

These distinctions matter because “building an agent” can mean different things.

For a simple application, it may only involve defining a tool-calling loop. For a more sophisticated agent, it may involve writing prompts, defining skills, connecting MCP servers, configuring middleware, and setting up context the agent can retrieve or update over time.

No-code building

There is also a no-code and low-code side of the build phase. Tools like LangSmith Fleet, Claude Cowork, and n8n allow more people to participate in agent development. That matters because the person who understands the workflow needed is not always the person who writes the code.

At the same time, no-code tools do not eliminate the need for engineering control. As systems become more complex, teams usually need ways to extend or override behavior in code. Hooks and middleware are especially important here because they allow teams to add custom logic around tool calls, context handling, approvals, auth, or business rules without rebuilding every agent from scratch.

The best build environments make simple things simple and complex things possible. They let domain experts edit prompts, skills, and context, while still giving engineers control over the parts that need to be reliable, testable, and governed.

构建

在构建阶段,团队要决定自己在打造什么样的智能体系统,以及希望采用哪一层抽象。

这里的工具形态非常丰富。有些工具以代码为先,有些则是无代码或低代码。有些聚焦抽象层,有些则聚焦为智能体提供可工作的环境,包括提示词、工具、技能和状态。

https://docs.langchain.com/langgraph-platform

在代码优先这一侧,团队通常会选择开源框架和执行支撑工具。在 LangChain 生态中,这包括 LangChain、LangGraph 和 Deep Agents。LangChain 之外的例子则包括 CrewAI 和 Claude Agents SDK。

这些工具运行在技术栈的不同层级。

智能体框架主要关注抽象。它们帮助开发者编排模型调用、工具、提示词、检索、结构化输出和智能体循环。LangChain 和 CrewAI 就属于这一类。

智能体运行时主要关注执行。它们支持那些需要状态、控制流、持久性和人工干预的智能体。LangGraph 是 LangChain 生态里最清晰的例子。它提供了一种方式来构建具备智能体特征的系统,这些系统可以分支、循环、暂停、恢复,并持续持久化状态。

智能体执行支架主要关注把事情做成。它们为智能体处理长时任务提供外围结构,包括提示词、技能、MCP servers、hooks、middleware,有时还包括文件系统。Deep Agents 和 Claude Agent SDK 就属于这种模式。

这些区别很重要,因为构建一个智能体这件事本身可以有不同含义。

对于一个简单应用,这可能只意味着定义一个工具调用循环。对于更复杂的智能体,这可能意味着编写提示词、定义技能、连接 MCP servers、配置 middleware,并设置智能体能够持续检索或更新的上下文。

无代码构建

构建阶段也有无代码和低代码的一面。像 LangSmith Fleet、Claude Cowork 和 n8n 这样的工具,让更多人可以参与智能体开发。这很重要,因为真正理解所需工作流的人,并不总是写代码的人。

与此同时,无代码工具并不会消除工程控制的必要性。随着系统变得更复杂,团队通常还是需要用代码扩展或覆盖行为。hooks 和 middleware 在这里尤其重要,因为它们让团队可以围绕工具调用、上下文处理、审批、鉴权或业务规则添加自定义逻辑,而不必从头重建每一个智能体。

最好的构建环境,会让简单的事情足够简单,也让复杂的事情成为可能。它们允许领域专家编辑提示词、技能和上下文,同时仍然把那些必须可靠、可测试、可治理的部分交给工程师控制。

Test

Before an agent is deployed, teams need a way to determine whether it is actually ready.

That does not mean building a perfect eval suite before anyone uses the agent. In practice, that is rarely realistic. It does mean having enough evals in place to catch obvious failures, compare versions, and avoid shipping changes blindly.

Most eval workflows start with a small dataset of representative tasks. Some examples come from expected use cases, while others come from manual testing, dogfooding, support tickets, prior traces, or known edge cases. Over time, production traces make these datasets much stronger, but testing should start before production.

Datasets and metrics

Datasets are how teams preserve what they learn. Without them, the same failures tend to reappear after prompt changes, model upgrades, or tool updates.

The right metrics depend on the task.

In some cases, there is a clear ground truth answer. Did the agent extract the right value? Did it choose the right label? Did it update the right field? These tasks can be measured directly for correctness.

Other times, there is no single ground truth answer. An agent may need to write a response, summarize a conversation, decide whether to escalate, or complete a task with many valid paths. In those cases, teams rely more on criteria-based evaluation. The questions become whether the response was grounded, whether the agent followed policy, whether it asked for clarification, or whether it completed the task efficiently without unnecessary tool calls.

Experiments

Experiments are what connect datasets and metrics to iteration. They allow teams to compare prompts, models,retrieval strategies, tool schemas, and orchestration patterns against the same evaluation set. . Over time, these experiments show whether the agent is improving or regressing.

The goal is not to create a perfect eval suite on day one. The goal is to start with a useful one and continuously improve it. The most valuable eval datasets are built from the hardest examples: first from development and dogfooding, then later from production.

Simulations

Simulation is another important part of testing.

Many agents are multi-turn systems. They do not just answer one question; they have a conversation, gather information, call tools, update state, and recover from ambiguity. For those agents, single-turn evals are not enough. Teams need multi-turn evals and simulated end-to-end interactions.

Voice agents are an obvious example, but the pattern is broader. Any agent that operates over a sequence of turns may need simulation. A support agent may need to handle a frustrated customer, ask follow-up questions, check order status, and decide whether escalation is necessary. A coding agent may need to inspect a repository, make changes, run tests, and respond to feedback. An internal operations agent may need to gather missing information before taking action.

Good testing practices help teams improve agents systematically without relying on vibes. They turn expected behavior into datasets, datasets into experiments, and experiments into better versions of the system. After deployment, monitoring supplies the real-world examples that make those evals stronger.

测试

在智能体部署之前,团队需要有办法判断它是否真的已经准备好了。

这并不意味着在任何人使用智能体之前,就先构建一套完美的评估体系。现实里,这几乎从来不切实际。但这确实意味着,要先有足够的评估,用来捕捉明显失败、比较不同版本,并避免盲目发布变更。

大多数评估流程,都是从一小批具有代表性的任务数据集开始。有些样本来自预期用例,有些则来自人工测试、内部试用、支持工单、历史 trace,或已知边界情况。随着时间推移,生产环境中的 traces 会让这些数据集变得更强,但测试应该在上线前就开始。

数据集与指标

数据集,是团队保存所学经验的方式。没有数据集,同样的失败往往会在提示词变更、模型升级或工具更新之后再次出现。

正确的指标取决于任务本身。

有些情况下,存在明确的标准答案。智能体有没有提取出正确的值。有没有选择正确的标签。有没有更新正确的字段。这类任务可以直接按正确性衡量。

还有些时候,并不存在唯一的标准答案。智能体可能需要撰写回复、总结对话、判断是否该升级处理,或通过多条都合理的路径完成任务。在这些情况下,团队会更多依赖基于标准的评估。问题会变成,回复是否有依据,智能体是否遵守策略,是否提出了澄清问题,或者是否在没有多余工具调用的情况下高效完成了任务。

实验

实验,是把数据集和指标真正连接到迭代上的环节。它让团队可以在同一套评估集上,比较提示词、模型、检索策略、工具 schema 和编排模式。随着时间推移,这些实验会显示出智能体是在进步,还是在退化。

目标不是在第一天就做出一套完美的评估体系。目标是先从一套有用的体系开始,然后持续改进它。最有价值的评估数据集,往往来自最难的样本,先来自开发阶段和内部试用,之后再来自生产环境。

模拟

模拟也是测试中的另一个重要部分。

许多智能体都是多轮系统。它们不只是回答一个问题,而是要展开对话、收集信息、调用工具、更新状态,并从模糊情况中恢复。对于这些智能体来说,单轮评估远远不够。团队需要多轮评估,以及模拟的端到端交互。

语音智能体是一个显而易见的例子,但这种模式远不止于此。任何跨多个轮次运行的智能体,都可能需要模拟。一个客服智能体可能需要应对情绪激动的客户,提出追问,检查订单状态,并判断是否有必要升级处理。一个编程智能体可能需要检查代码仓库、做出修改、运行测试,并根据反馈作出响应。一个内部运营智能体可能需要在执行操作前先收集缺失信息。

良好的测试实践,能帮助团队以系统化方式改进智能体,而不是依赖感觉。它把预期行为变成数据集,把数据集变成实验,再把实验变成更好的系统版本。部署之后,监控又会提供真实世界里的样本,让这些评估变得更强。

Deploy

Once an agent has been built and evaluated, it needs an environment where it can reliably run.

For simple agents, deployment may look similar to deploying a traditional application. But many agents need more than a stateless server. They run over longer periods of time, call tools, wait for human input, write files, recover from interruptions, and maintain state across multiple interactions or tasks..

That is why the runtime matters.

A production agent runtime typically needs to support durable execution and human-in-the-loop patterns. Durable execution means the agent can checkpoint progress and resume instead of losing work when something fails. Human-in-the-loop means the agent can pause when it needs approval, clarification, or review.

There are off-the-shelf solutions for this. LangSmith Deployment provides infrastructure for deploying and managing Deep Agents and LangGraph agents. AWS AgentCore is another example of a managed runtime for agents. Some teams also build their own runtime on top of systems like Temporal, especially when they already use Temporal for long-running workflows elsewhere in the stack.

Sandboxes

Many agents also need dedicated execution environments.

Agents increasingly need to write code, execute code, inspect files, transform documents, or interact with a filesystem. In those cases, teams need to decide where that work happens. Sandboxes are a common solution. They provide isolated execution environments with filesystem access, while reducing the blast radius of mistakes or unsafe behavior.

Examples include LangSmith Sandboxes, Daytona, and E2B.

Not every agent requires a full sandbox. In some cases, the agent just needs a place to store and retrieve files. A virtual filesystem can be enough. Deep Agents supports this pattern by allowing agents to use files as working memory without necessarily executing arbitrary code inside a sandbox. Underneath, that filesystem might be backed by systems like Postgres or S3.

Context Hub

Another often overlooked part of deployment is managing prompts and context.

Some of the most important parts of an agent are not traditional application code. Prompts, retrieval context, skills, and task instructions may need to change more often than the application itself. They may also need to be edited by people who are not engineers.

That creates the need for a prompt or context hub: a place to store, version, review, and update the non-code parts of the agent. This allows teams to adjust agent behavior without a full deploy, and it lets domain experts own the context they understand best.

In practice, deployment is not just about putting an agent on a server. It is about giving the agent the runtime, execution environment, and context management systems it needs to do real work.

部署

一旦智能体被构建并完成评估,它就需要一个能够稳定运行的环境。

对于简单智能体,部署可能看起来和传统应用部署差不多。但许多智能体需要的不只是一个无状态服务器。它们会持续运行较长时间、调用工具、等待人工输入、写入文件、从中断中恢复,并在多次交互或任务之间维持状态。

这就是为什么运行时很重要。

一个生产级智能体运行时,通常需要支持持久执行和 human-in-the-loop 模式。持久执行意味着智能体可以检查点保存进度,并在出现故障时恢复,而不是丢失工作。human-in-the-loop 意味着智能体在需要审批、澄清或审查时可以暂停。

这方面已经有现成方案。LangSmith Deployment 提供了部署和管理 Deep Agents 与 LangGraph agents 的基础设施。AWS AgentCore 是另一个托管式智能体运行时的例子。还有一些团队会在 Temporal 这样的系统上自己构建运行时,尤其是当它们本来就已经在技术栈的其他地方用 Temporal 处理长时工作流时。

沙箱

很多智能体还需要专用的执行环境。

智能体越来越常需要编写代码、执行代码、检查文件、转换文档,或与文件系统交互。在这些情况下,团队需要决定这些工作在哪里发生。沙箱是一种常见方案。它们提供带有文件系统访问能力的隔离执行环境,同时降低错误或不安全行为带来的影响范围。

例子包括 LangSmith Sandboxes、Daytona 和 E2B。

并不是每个智能体都需要完整沙箱。在某些情况下,智能体只需要一个可以存取文件的地方。一个虚拟文件系统就够了。Deep Agents 支持这种模式,让智能体把文件当成工作记忆来使用,而不一定要在沙箱里执行任意代码。在底层,这个文件系统可能由 Postgres 或 S3 之类的系统支撑。

上下文中心

部署中另一个常被忽视的部分,是提示词和上下文的管理。

智能体最重要的一些部分,并不是传统意义上的应用代码。提示词、检索上下文、技能和任务指令,可能比应用本身更频繁地变化。它们也可能需要由不是工程师的人来编辑。

这就带来了提示词中心或上下文中心的需求,也就是一个用来存储、版本化、审查和更新智能体非代码部分的地方。这样一来,团队无需完整部署就能调整智能体行为,也能让领域专家负责他们最了解的上下文。

在实际中,部署不只是把智能体放到服务器上。它更是在为智能体配齐它完成真实工作所需要的运行时、执行环境和上下文管理系统。

Monitor

Once agents are deployed, teams need visibility into how they actually behave in production.

This is where monitoring agents differs from monitoring traditional software. Metrics like latency, cost, error rates, and uptime still matter, but they are only part of the picture. An agent can return a technically successful response and still fail the task itself. It may call the wrong tool, rely on the wrong context, skip a required approval step, or produce an answer that sounds plausible but is wrong.

To understand those failures, teams need traces.

A trace captures the full trajectory of the agent: the inputs it received, the model calls it made, the tools it invoked, the outputs it received, and the final response or action it produced. This is the level of detail you need to understand what the agent actually did.

This is why we have argued that agent observability powers agent evaluation, and why the agent improvement loop starts with a trace. If you cannot see the trajectory, you cannot reliably debug the behavior or turn those failures into future evals.

Signals

Monitoring should also include harvesting signals from those traces.

Some of those signals can come from LLM-as-judge evaluators. For example, a judge can score whether the agent answered the user’s question, followed policy, used the right tone, or completed the task. Other signals can be simpler. A regex can catch whether a required phrase appeared, whether a forbidden tool was called, or whether a known failure pattern occurred.

These signals are useful for more than just quality checks. They can also become a form of product analytics. They can tell you which tasks users are asking agents to do, where agents are getting stuck, how often users correct them, and where users perceive errors.

Feedback

Feedback is another core part of monitoring.

It is not enough to store traces alone. Teams also need to store feedback with those traces. That feedback can come from LLM judges, regex-based signals, human reviewers, or direct user feedback collected through an API. In LangSmith, for example, teams can attach user feedback directly to the underlying run, making it easier to connect “the user was unhappy” to “the agent used the wrong tool three steps earlier.”

Dashboards

Finally, teams need dashboards and alerts that can surface trends over time.

A useful agent dashboard tracks metrics like usage, feedback, latency, cost, tool calls, evaluator scores, and recurring failure patterns. Alerts should trigger when important thresholds are crossed, such as rising latency, increasing costs, failing tools, declining user feedback, or spikes in policy violations.

Good monitoring is not just about knowing whether the system is up. It is about understanding whether the agent is doing the right work, in the right way, and improving over time.

The strongest monitoring systems feed directly back into evaluation. Important traces become dataset examples, recurring failures become metrics, and production behavior becomes the foundation for the next round of improvement.

监控

一旦智能体部署完成,团队就需要看清它在生产环境里到底是怎么表现的。

这正是监控智能体和监控传统软件不同的地方。延迟、成本、错误率和可用性这些指标仍然重要,但它们只是一部分。智能体即便返回了一个技术上成功的响应,仍然可能在任务本身上失败。它可能调用了错误的工具,依赖了错误的上下文,跳过了必要的审批步骤,或者给出一个听起来合理但其实错误的答案。

要理解这些失败,团队就需要 traces。

一条 trace 记录了智能体的完整轨迹,包括它接收到的输入、发起的模型调用、调用的工具、收到的输出,以及最终产出的响应或动作。这才是理解智能体实际做了什么所需要的细节层级。

这也是为什么我们一直认为,智能体可观测性是智能体评估的动力来源,也为什么智能体改进循环是从一条 trace 开始的。如果你看不到这条轨迹,就无法可靠地调试它的行为,也无法把这些失败转化成未来的评估样本。

信号

监控还应该包括从这些 traces 中提取信号。

其中一些信号可以来自 LLM-as-judge evaluators。比如,一个 judge 可以给出评分,判断智能体是否回答了用户问题、是否遵守策略、是否使用了合适语气、是否完成了任务。另一些信号则可以更简单。一个 regex 就可以捕捉是否出现了必需短语,是否调用了被禁止的工具,或者是否出现了某种已知失败模式。

这些信号的用途不只是质量检查。它们还可以成为一种产品分析形式。它们可以告诉你,用户在让智能体做哪些任务,智能体卡在了哪里,用户多久会纠正它一次,以及用户会在哪些地方感知到错误。

反馈

反馈也是监控中的另一个核心部分。

光保存 traces 还不够。团队还需要把反馈和这些 traces 一起保存。反馈可以来自 LLM judges、基于 regex 的信号、人工审查员,或通过 API 收集的直接用户反馈。比如在 LangSmith 中,团队可以把用户反馈直接附加到底层 run 上,这样就更容易把用户不满意,和智能体在前三步用了错误工具这样的事实关联起来。

仪表盘

最后,团队还需要仪表盘和告警,用来展示随时间变化的趋势。

一个有用的智能体仪表盘,会跟踪使用量、反馈、延迟、成本、工具调用、评估器得分,以及反复出现的失败模式等指标。告警则应该在重要阈值被突破时触发,比如延迟上升、成本增加、工具失效、用户反馈下降,或策略违规激增。

好的监控,不只是知道系统是否在线。它更是在理解智能体是否在做正确的事,是否以正确的方式在做,并且是否在持续变好。

最强的监控系统,会直接把结果反馈回评估。重要的 traces 会变成数据集样本,反复出现的失败会变成指标,而生产环境中的行为则会成为下一轮改进的基础。

Iterate

The best organizations move through the agent development lifecycle quickly and systematically.

They do not wait for a perfect agent before shipping. Instead, they build something useful, test it enough to understand its behavior, deploy it in a controlled way, monitor how it performs in production, and feed those learnings back into the next version.

That does not mean shipping carelessly. The key is having visibility.

Teams with datasets, experiments, tracing, feedback, and dashboards can learn directly from real real usage. They can test changes before rolling them out broadly, identify what broke in production, turn failures into evals, and improve the agent without relying on guesswork.

This is how teams hill-climb, and how agent systems improve over time.

The most effective teams find the hard examples, understand why the agent failed, and adjust the prompt, tool configuration, retrieval strategy, model, middleware, or workflow. They re-run the evals, deploy the better version, and monitoring gives them the next edge cases and failures.

Inside an enterprise, the challenge is making that loop repeatable across teams.

If every team has to build its own evaluation framework, deployment infrastructure, tracing system, feedback pipeline, and dashboards from scratch, agent development will move slowly. The most effective organizations invest in shared infrastructure so teams can move through the lifecycle without constantly reinventing the underlying systems.

That is what makes the agent development lifecycle an operational practice.

迭代

最优秀的组织,会快速而系统地跑完智能体开发生命周期。

它们不会等到智能体完美才发布。相反,它们会先构建出有用的东西,做足够的测试来理解它的行为,以可控方式部署它,监控它在生产环境中的表现,再把这些经验反馈到下一个版本。

这并不意味着草率发布。关键在于可见性。

拥有数据集、实验、tracing、反馈和仪表盘的团队,可以直接从真实使用中学习。它们可以在大范围推广之前测试变更,找出生产环境里哪里坏了,把失败转成评估,再在不靠猜测的情况下改进智能体。

这就是团队如何一步步爬坡,也是在这个过程中,智能体系统如何持续进步。

最有效的团队,会找出那些困难样本,理解智能体为什么失败,然后调整提示词、工具配置、检索策略、模型、middleware 或工作流。它们会重新运行 evals,部署更好的版本,而监控会继续给它们带来下一批边界情况和失败样本。

在企业内部,挑战在于如何让这个循环在不同团队之间也能重复运行。

如果每个团队都得从零构建自己的评估框架、部署基础设施、tracing 系统、反馈管道和仪表盘,智能体开发就会推进得很慢。最有效的组织会投资共享基础设施,让团队能够穿过整个生命周期,而不用反复重造底层系统。

这正是智能体开发生命周期之所以成为一种运营实践的原因。

Govern

Governance sits around the entire agent development lifecycle.

For a single agent, lightweight controls may be enough. As organizations deploy more agents, governance becomes necessary. Without it, teams quickly end up with agents that are difficult to discover, difficult to monitor, expensive to run, and unclear in what they are allowed to do.

https://www.langchain.com/langgraph

Cost

The first governance challenge is cost.

Agents can become expensive because they may involve multiple model calls, long context windows, repeated tools usage, retries, or run for a long time. Organizations need ways to track and manage that spend through budgets, usage monitoring, alerts, and visibility into which agents, teams, models, or tools are driving costs.

Tool Access

The second governance challenge is tool access.

Agents are useful because they can take action, but that also introduces risk. Teams need clear controls around which tools an agent can access, under what conditions, and on behalf of which users.

This is where audit trails become important. If an agent calls a tool, organizations should be able to inspect which agent made the call, what inputs it used, what outputs it produced, and what user or policy authorized the action. Tool calls are often where agent behavior drives business impact, so they need to be observable and reviewable.

Human-in-the-loop is another important governance mechanism.

Not every tool call should be fully automated. Some operations should pause for human review, especially when they involve customers, financial systems, sensitive data, or production infrastructure. Human-in-the-loop workflows work best when they are designed into the system from the beginning.

Discoverability

The third governance challenge is discoverability and reuse.

As organizations build more agents, they also accumulate more reusable assets such as prompts, skills, tools, retrieval sources, policies, and even other agents. Without good discovery and governance mechanisms, teams tend to recreate these components repeatedly, leading to inconsistency.Shared context and shared agents need to be findable, reusable, and governed.

This is especially important for skills. A skill can encode a workflow, a writing style, a domain-specific procedure, or instructions for using a tool. If one team has already built a good skill, another team should be able to find it rather than write a new version from scratch.

Good governance is not about slowing teams down. It is about making fast iteration possible without losing visibility, control, or consistency as agent systems scale.

治理

治理覆盖整个智能体开发生命周期。

对于单个智能体,轻量控制可能就够了。随着组织部署越来越多的智能体,治理就变得必要。没有治理,团队很快就会面对一些难以发现、难以监控、运行昂贵,而且权限边界不清的智能体。

https://www.langchain.com/langgraph

成本

第一个治理挑战是成本。

智能体可能会变得昂贵,因为它可能涉及多次模型调用、超长上下文窗口、反复使用工具、重试,或者运行很长时间。组织需要通过预算、使用监控、告警,以及对究竟是哪些智能体、团队、模型或工具在推高成本的可见性,来追踪和管理这部分支出。

工具访问

第二个治理挑战是工具访问。

智能体之所以有用,是因为它们能采取行动,但这也引入了风险。团队需要围绕智能体可以访问哪些工具、在什么条件下访问、以及代表哪些用户访问,建立清晰的控制机制。

这正是审计轨迹变得重要的地方。如果一个智能体调用了工具,组织应该能够检查到底是哪个智能体发起了调用,它使用了什么输入,产出了什么输出,以及是哪位用户或哪条策略授权了这次操作。工具调用往往是智能体行为真正影响业务的地方,因此它们必须可观测、可审查。

human-in-the-loop 也是另一种重要的治理机制。

并不是每一次工具调用都应该完全自动化。有些操作应该暂停,等待人工审查,尤其是当它们涉及客户、财务系统、敏感数据或生产基础设施时。human-in-the-loop 工作流,只有从系统一开始设计时就被纳入,效果才最好。

可发现性

第三个治理挑战是可发现性与复用。

随着组织构建越来越多的智能体,也会积累越来越多可复用资产,比如提示词、技能、工具、检索源、策略,甚至其他智能体。没有良好的发现和治理机制,团队往往会反复重建这些组件,最终导致不一致。共享上下文和共享智能体,必须能被找到、被复用,并且被治理。

这一点对技能尤其重要。一个技能可以编码一个工作流、一种写作风格、一个领域专用流程,或者一套使用工具的指令。如果某个团队已经做出了一个好技能,另一个团队就应该能够找到它,而不是从零重写一个新版本。

好的治理,不是为了拖慢团队速度。它是为了让快速迭代在系统规模扩大时,依然不会失去可见性、控制力和一致性。

Conclusion

The best organizations have already started to operate this way. They ship early, but they do not ship blindly. They evaluate before deploying, monitor behavior after deployment, and continuously use what they learn to make the next version better.

That is what makes agent development repeatable. It is also what allows agents to move from demos into reliable production systems.

结论

最优秀的组织已经开始这样运作了。它们会尽早发布,但不会盲目发布。它们会在部署前评估,在部署后监控行为,并持续利用所学经验让下一个版本变得更好。

这正是智能体开发之所以能够被重复执行的原因。也正因如此,智能体才能从演示走向可靠的生产系统。

链接 http://x.com/i/article/2053156837453959169

Everyone wants to ship agents.

The best organizations have figured out how to do it repeatedly, safely, and systematically. They ship early, learn from real usage, and iterate quickly. They don’t treat agents as one-off demos or isolated projects.

Instead, they’ve built an agent development lifecycle that creates momentum by turning experimentation into a repeatable system for shipping, learning, and improving over time.

That lifecycle has four parts:

Build → Test → Deploy → Monitor

https://www.langchain.com/blog/traces-start-agent-improvement-loop

The order is intentional. Testing should start before an agent reaches production, not after. Teams need to test the agents before deployment, deploy them in a controlled way, monitor how they behave in production, and feed those learnings back into the next build and evaluation cycle.

For a single agent, this process can stay lightweight. Across many agents, it becomes an infrastructure and governance challenge. Teams need shared ways to control cost, manage tool access, inspect tool calls, reuse context, and decide where humans need to be involved.

The difference between getting an agent to work once to building agents as a repeatable practice comes from having the right development lifecycle in place.

Build

The build phase is where teams decide what kind of agent system they are creating and what level of abstraction they want to use.

There is a wide range of tooling here. Some tools are code-first, while others are no-code or low-code. Some focus on abstractions, while others focus on giving agents a working environment with prompts, tools, skills, and state.

https://docs.langchain.com/langgraph-platform

On the code-first side, teams often reach for open-source frameworks and harnesses. In the LangChain ecosystem, that includes LangChain, LangGraph, and Deep Agents. Outside of LangChain, examples include CrewAI and Claude Agents SDK.

These tools operate at different layers of the stack.

Agent frameworks focus primarily on abstractions. They help developers compose model calls, tools, prompts, retrieval, structured outputs, and agent loops. LangChain and CrewAI are examples in this category.

Agent runtimes focus on execution. They support agents that need state, control flow, durability, and human intervention. LangGraph is the clearest example in the LangChain ecosystem. It gives you a way to build agentic systems that can branch, loop, pause, resume, and persist state over time.

Agent harnesses focus on doing. They provide the surrounding structure agents need for longer-running tasks: prompts, skills, MCP servers, hooks, middleware, and sometimes a filesystem. Deep Agents and the Claude Agent SDK are examples of this pattern.

These distinctions matter because “building an agent” can mean different things.

For a simple application, it may only involve defining a tool-calling loop. For a more sophisticated agent, it may involve writing prompts, defining skills, connecting MCP servers, configuring middleware, and setting up context the agent can retrieve or update over time.

No-code building

There is also a no-code and low-code side of the build phase. Tools like LangSmith Fleet, Claude Cowork, and n8n allow more people to participate in agent development. That matters because the person who understands the workflow needed is not always the person who writes the code.

At the same time, no-code tools do not eliminate the need for engineering control. As systems become more complex, teams usually need ways to extend or override behavior in code. Hooks and middleware are especially important here because they allow teams to add custom logic around tool calls, context handling, approvals, auth, or business rules without rebuilding every agent from scratch.

The best build environments make simple things simple and complex things possible. They let domain experts edit prompts, skills, and context, while still giving engineers control over the parts that need to be reliable, testable, and governed.

Test

Before an agent is deployed, teams need a way to determine whether it is actually ready.

That does not mean building a perfect eval suite before anyone uses the agent. In practice, that is rarely realistic. It does mean having enough evals in place to catch obvious failures, compare versions, and avoid shipping changes blindly.

Most eval workflows start with a small dataset of representative tasks. Some examples come from expected use cases, while others come from manual testing, dogfooding, support tickets, prior traces, or known edge cases. Over time, production traces make these datasets much stronger, but testing should start before production.

Datasets and metrics

Datasets are how teams preserve what they learn. Without them, the same failures tend to reappear after prompt changes, model upgrades, or tool updates.

The right metrics depend on the task.

In some cases, there is a clear ground truth answer. Did the agent extract the right value? Did it choose the right label? Did it update the right field? These tasks can be measured directly for correctness.

Other times, there is no single ground truth answer. An agent may need to write a response, summarize a conversation, decide whether to escalate, or complete a task with many valid paths. In those cases, teams rely more on criteria-based evaluation. The questions become whether the response was grounded, whether the agent followed policy, whether it asked for clarification, or whether it completed the task efficiently without unnecessary tool calls.

Experiments

Experiments are what connect datasets and metrics to iteration. They allow teams to compare prompts, models,retrieval strategies, tool schemas, and orchestration patterns against the same evaluation set. . Over time, these experiments show whether the agent is improving or regressing.

The goal is not to create a perfect eval suite on day one. The goal is to start with a useful one and continuously improve it. The most valuable eval datasets are built from the hardest examples: first from development and dogfooding, then later from production.

Simulations

Simulation is another important part of testing.

Many agents are multi-turn systems. They do not just answer one question; they have a conversation, gather information, call tools, update state, and recover from ambiguity. For those agents, single-turn evals are not enough. Teams need multi-turn evals and simulated end-to-end interactions.

Voice agents are an obvious example, but the pattern is broader. Any agent that operates over a sequence of turns may need simulation. A support agent may need to handle a frustrated customer, ask follow-up questions, check order status, and decide whether escalation is necessary. A coding agent may need to inspect a repository, make changes, run tests, and respond to feedback. An internal operations agent may need to gather missing information before taking action.

Good testing practices help teams improve agents systematically without relying on vibes. They turn expected behavior into datasets, datasets into experiments, and experiments into better versions of the system. After deployment, monitoring supplies the real-world examples that make those evals stronger.

Deploy

Once an agent has been built and evaluated, it needs an environment where it can reliably run.

For simple agents, deployment may look similar to deploying a traditional application. But many agents need more than a stateless server. They run over longer periods of time, call tools, wait for human input, write files, recover from interruptions, and maintain state across multiple interactions or tasks..

That is why the runtime matters.

A production agent runtime typically needs to support durable execution and human-in-the-loop patterns. Durable execution means the agent can checkpoint progress and resume instead of losing work when something fails. Human-in-the-loop means the agent can pause when it needs approval, clarification, or review.

There are off-the-shelf solutions for this. LangSmith Deployment provides infrastructure for deploying and managing Deep Agents and LangGraph agents. AWS AgentCore is another example of a managed runtime for agents. Some teams also build their own runtime on top of systems like Temporal, especially when they already use Temporal for long-running workflows elsewhere in the stack.

Sandboxes

Many agents also need dedicated execution environments.

Agents increasingly need to write code, execute code, inspect files, transform documents, or interact with a filesystem. In those cases, teams need to decide where that work happens. Sandboxes are a common solution. They provide isolated execution environments with filesystem access, while reducing the blast radius of mistakes or unsafe behavior.

Examples include LangSmith Sandboxes, Daytona, and E2B.

Not every agent requires a full sandbox. In some cases, the agent just needs a place to store and retrieve files. A virtual filesystem can be enough. Deep Agents supports this pattern by allowing agents to use files as working memory without necessarily executing arbitrary code inside a sandbox. Underneath, that filesystem might be backed by systems like Postgres or S3.

Context Hub

Another often overlooked part of deployment is managing prompts and context.

Some of the most important parts of an agent are not traditional application code. Prompts, retrieval context, skills, and task instructions may need to change more often than the application itself. They may also need to be edited by people who are not engineers.

That creates the need for a prompt or context hub: a place to store, version, review, and update the non-code parts of the agent. This allows teams to adjust agent behavior without a full deploy, and it lets domain experts own the context they understand best.

In practice, deployment is not just about putting an agent on a server. It is about giving the agent the runtime, execution environment, and context management systems it needs to do real work.

Monitor

Once agents are deployed, teams need visibility into how they actually behave in production.

This is where monitoring agents differs from monitoring traditional software. Metrics like latency, cost, error rates, and uptime still matter, but they are only part of the picture. An agent can return a technically successful response and still fail the task itself. It may call the wrong tool, rely on the wrong context, skip a required approval step, or produce an answer that sounds plausible but is wrong.

To understand those failures, teams need traces.

A trace captures the full trajectory of the agent: the inputs it received, the model calls it made, the tools it invoked, the outputs it received, and the final response or action it produced. This is the level of detail you need to understand what the agent actually did.

This is why we have argued that agent observability powers agent evaluation, and why the agent improvement loop starts with a trace. If you cannot see the trajectory, you cannot reliably debug the behavior or turn those failures into future evals.

Signals

Monitoring should also include harvesting signals from those traces.

Some of those signals can come from LLM-as-judge evaluators. For example, a judge can score whether the agent answered the user’s question, followed policy, used the right tone, or completed the task. Other signals can be simpler. A regex can catch whether a required phrase appeared, whether a forbidden tool was called, or whether a known failure pattern occurred.

These signals are useful for more than just quality checks. They can also become a form of product analytics. They can tell you which tasks users are asking agents to do, where agents are getting stuck, how often users correct them, and where users perceive errors.

Feedback

Feedback is another core part of monitoring.

It is not enough to store traces alone. Teams also need to store feedback with those traces. That feedback can come from LLM judges, regex-based signals, human reviewers, or direct user feedback collected through an API. In LangSmith, for example, teams can attach user feedback directly to the underlying run, making it easier to connect “the user was unhappy” to “the agent used the wrong tool three steps earlier.”

Dashboards

Finally, teams need dashboards and alerts that can surface trends over time.

A useful agent dashboard tracks metrics like usage, feedback, latency, cost, tool calls, evaluator scores, and recurring failure patterns. Alerts should trigger when important thresholds are crossed, such as rising latency, increasing costs, failing tools, declining user feedback, or spikes in policy violations.

Good monitoring is not just about knowing whether the system is up. It is about understanding whether the agent is doing the right work, in the right way, and improving over time.

The strongest monitoring systems feed directly back into evaluation. Important traces become dataset examples, recurring failures become metrics, and production behavior becomes the foundation for the next round of improvement.

Iterate

The best organizations move through the agent development lifecycle quickly and systematically.

They do not wait for a perfect agent before shipping. Instead, they build something useful, test it enough to understand its behavior, deploy it in a controlled way, monitor how it performs in production, and feed those learnings back into the next version.

That does not mean shipping carelessly. The key is having visibility.

Teams with datasets, experiments, tracing, feedback, and dashboards can learn directly from real real usage. They can test changes before rolling them out broadly, identify what broke in production, turn failures into evals, and improve the agent without relying on guesswork.

This is how teams hill-climb, and how agent systems improve over time.

The most effective teams find the hard examples, understand why the agent failed, and adjust the prompt, tool configuration, retrieval strategy, model, middleware, or workflow. They re-run the evals, deploy the better version, and monitoring gives them the next edge cases and failures.

Inside an enterprise, the challenge is making that loop repeatable across teams.

If every team has to build its own evaluation framework, deployment infrastructure, tracing system, feedback pipeline, and dashboards from scratch, agent development will move slowly. The most effective organizations invest in shared infrastructure so teams can move through the lifecycle without constantly reinventing the underlying systems.

That is what makes the agent development lifecycle an operational practice.

Govern

Governance sits around the entire agent development lifecycle.

For a single agent, lightweight controls may be enough. As organizations deploy more agents, governance becomes necessary. Without it, teams quickly end up with agents that are difficult to discover, difficult to monitor, expensive to run, and unclear in what they are allowed to do.

https://www.langchain.com/langgraph

Cost

The first governance challenge is cost.

Agents can become expensive because they may involve multiple model calls, long context windows, repeated tools usage, retries, or run for a long time. Organizations need ways to track and manage that spend through budgets, usage monitoring, alerts, and visibility into which agents, teams, models, or tools are driving costs.

Tool Access

The second governance challenge is tool access.

Agents are useful because they can take action, but that also introduces risk. Teams need clear controls around which tools an agent can access, under what conditions, and on behalf of which users.

This is where audit trails become important. If an agent calls a tool, organizations should be able to inspect which agent made the call, what inputs it used, what outputs it produced, and what user or policy authorized the action. Tool calls are often where agent behavior drives business impact, so they need to be observable and reviewable.

Human-in-the-loop is another important governance mechanism.

Not every tool call should be fully automated. Some operations should pause for human review, especially when they involve customers, financial systems, sensitive data, or production infrastructure. Human-in-the-loop workflows work best when they are designed into the system from the beginning.

Discoverability

The third governance challenge is discoverability and reuse.

As organizations build more agents, they also accumulate more reusable assets such as prompts, skills, tools, retrieval sources, policies, and even other agents. Without good discovery and governance mechanisms, teams tend to recreate these components repeatedly, leading to inconsistency.Shared context and shared agents need to be findable, reusable, and governed.

This is especially important for skills. A skill can encode a workflow, a writing style, a domain-specific procedure, or instructions for using a tool. If one team has already built a good skill, another team should be able to find it rather than write a new version from scratch.

Good governance is not about slowing teams down. It is about making fast iteration possible without losing visibility, control, or consistency as agent systems scale.

Conclusion

The best organizations have already started to operate this way. They ship early, but they do not ship blindly. They evaluate before deploying, monitor behavior after deployment, and continuously use what they learn to make the next version better.

That is what makes agent development repeatable. It is also what allows agents to move from demos into reliable production systems.

📋 讨论归档

讨论进行中…