返回列表
🧠 阿头学 · 💬 讨论题

生产环境“自愈”流水线:自动发现回归、分诊归因并生成修复 PR

这套方案真正有价值的不是“AI 会修 bug”,而是把部署后的检测、归因和修复串成闭环;但把“自动提 PR”包装成“生产自愈”明显夸大了能力。
打开原文 ↗

2026-04-03 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 真正有价值的不是写代码,而是发布后闭环 文章最站得住脚的判断是:有了 coding agent 之后,瓶颈已经不是“把代码写出来”,而是“上线后怎么自动发现是否搞坏了系统、怎么快速定位、怎么发起修复”;这个重心转移是对的,也比单纯吹编码能力更接近真实工程难点。
  • 架构上最合理的是“统计筛查 + 分诊约束 + 修复执行”三段式 作者没有把错误直接丢给 Open SWE,而是先用基线和泊松检验过滤噪声,再让分诊智能体建立“diff 某行改动—错误”因果链,最后才让编码智能体修;这种职责分离是成熟设计,因为它承认了 agent 最容易犯的错不是不会写,而是在高噪声上下文里乱写。
  • “自愈”这个词用过头了 系统当前终点只是自动开 PR,而不是自动回滚、自动热修、自动恢复 SLA;既然故障版本仍在线,且修复仍要人工 review 和重新发布,那它最多算“自动补救流水线”,不算严格意义上的生产自愈,这个概念偷换必须点出来。
  • 统计检测方法能用,但前提并不稳 用 7 天基线 + 60 分钟窗口 + 泊松检验判断错误签名是否异常上升,这在工程上是低成本且可落地的;但生产错误常常受流量峰值、第三方 API、批量失败影响,独立性假设经常不成立,所以 p<0.05 不能被当成硬证据,它只是一个粗筛器,不是归因证明。
  • 最大的风险不在修复,而在归因幻觉 分诊智能体被要求从最新 diff 中找到明确因果链,这个约束方向没错,但 LLM 很容易为了完成任务而脑补因果;再加上当前只看最近一次 diff、错误归一化依赖正则和截断,这会让系统对延迟爆发型 bug、跨版本问题和复杂日志模式都存在系统性漏判或误判。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 这说明 AI coding 的护城河不在“会写”,而在“出事后能否闭环”;下一步如果做 agent 产品或内部工具,应该优先补 observability、回归检测、分诊和 rollback 策略,而不是继续堆 demo 式生成能力。
  • 对 Neta 意味着什么、下一步怎么用 这套“基线对比 + 异常筛查 + 结构化归因”不只适用于代码发布,也适用于增长波动、转化异常、内容质量回退;下一步可以把同样框架迁移到运营指标监控,而不是继续靠拍脑袋复盘。
  • 对 Uota 意味着什么、下一步怎么用 文章暴露了一个关键产品判断:用户要的不是更能写的 agent,而是更可信的 agent;下一步在设计 agent workflow 时,应优先做“看门人层”,例如先做任务分诊、置信度输出、证据链审计,再让执行 agent 动手。
  • 对三者共同意味着什么、下一步怎么用 这篇文章最值得拿走的不是 LangChain 工具链,而是“四段式闭环:Deploy → Monitor → Triage → Fix”;下一步可以各自盘点业务中哪些流程已经有执行 agent,但缺少异常发现和分诊层,先补闭环再谈全自动。

讨论引子

1. 只会自动提 PR、不会自动恢复服务的系统,配不配叫“自愈”,还是这是典型的技术营销包装? 2. 在生产环境里,错误归因到底该更多交给统计模型,还是更多交给 LLM 分诊,哪一边的误判代价更大? 3. 如果必须在“自动回滚”和“向前修复”之间二选一,什么条件下应该让 agent 有权直接回滚?

给我们的 GTM Agent 搭了一个会自愈的部署流水线。每次部署后,它会检测回归,判断这些问题是不是这次改动引起的,然后启动另一个智能体去开一个包含修复的 PR。

有了写代码的智能体,发布的难点不在把代码推上去,而在后面的一切:上一版部署有没有把东西搞坏,问题到底因何而起,能不能在用户察觉前修好。希望的是部署完就去做别的事,并且相信只要出现回归,系统会自己发现并把闭环跑完。

自愈流程如何运转

GTM Agent 是用 Deep Agents 搭的,通过 LangSmith Deployments 部署。团队里本来就有一个内部编码智能体 Open SWE,它是一个开源的异步编码智能体,可以调研代码库、写修复并开 PR。缺的那块是自动化的回归检测和分诊,把生产环境的错误自动关联到 Open SWE。

https://x.com/RampLabs/status/2036165188899012655

每次部署到生产后,会立刻触发一个自愈的 GitHub Action,把构建日志和服务端日志都抓下来。流程分两条路:(1) 立刻捕捉构建失败,(2) 在一个时间窗口内监控服务端回归。任意一路确认是实打实的问题,就启动 Open SWE 去修复并开 PR。

捕捉 Docker 构建失败

第一步看构建日志,确认 Docker 镜像能正常构建。要是镜像构建失败,流水线会自动从 CLI 把报错日志导出来,再把最近一次提交到 main 的 git diff 拉出来,一并交给 Open SWE,全程不需要人介入。构建失败几乎总是最近的改动造成的,所以给一个范围很窄的 diff,就足够 Open SWE 直接动手。

监控部署后的错误

服务端问题比构建失败难处理。生产系统本身就有一条底噪式的错误率,网络超时、第三方 API 抖动、偶发失败等等。理想情况下每个都该跟踪并修掉,但要回答上一版部署有没有弄坏东西,就得把这次改动带来的错误,从本来就存在的噪声里分离出来。这一步干的就是这个。

先收集过去 7 天所有错误日志作为基线。把它们归一化成错误签名,用正则把 UUID、时间戳和很长的数字串替换掉,再截断到 200 个字符。这样即使细节不同,逻辑上同一种错误也会被归到同一个桶里。

接着在部署后的 60 分钟窗口里轮询当前 revision 的错误,归一化方式和上面一致。窗口结束时,就有了两个完全不同时间尺度上的计数,一周的基线数据,以及部署后一小时的数据。虽然可以粗暴地直接比这两个数字来判断最新改动是否引入了错误,但更想用一个更有章法的方法,也顺便复习一下概率分布🙃。

用泊松检验做门槛

泊松分布用来建模在一个固定时间区间内,某个事件发生的次数。在已知平均发生率(λ)并假设事件相互独立的前提下,它可以描述计数的分布:

https://x.com/LangChain/status/2031055593360990358

生产环境的基线错误与泊松模型的契合度还不错。用这 7 天基线数据,为每个错误签名估计每小时的期望错误率,然后按比例换算到部署后的 60 分钟窗口。如果观测到的计数显著高于分布预测(p < 0.05),就把它标记为可能的回归。对那些完全新的错误签名(基线里完全没出现过),只要在监控窗口里重复出现,就同样标记。

https://www.langchain.com/langsmith/deployment

不过服务端错误并不总是独立的。流量尖峰或 API 故障造成的相关性失败,会违背独立性假设;单靠统计检验也分不清错误飙升究竟是我们改动造成的,还是第三方 API 挂了。这时就轮到分诊智能体上场。

分诊智能体

没有把错误直接喂给 Open SWE,因为它很容易一上来就想改代码。这里又加了一道门槛,把最近一次提交的 diff 和具体错误一起交给一个分诊智能体(基于 Deep Agents)。

分诊智能体先把所有变更文件分类为 runtime、prompt/config、test、docs、CI 等。若改动只碰了非 runtime 文件,这次部署导致该错误的概率极低。这样能避免误报,防止智能体从一个测试文件胡乱脑补出一条通往生产故障的因果链。

如果确实改了 runtime,智能体必须在 diff 的某一行改动和观测到的错误之间建立明确的因果关联。

它会返回一个结构化结论,包括决策、置信度、理由,以及它认为由这次改动引起的错误签名。这样收敛之后,Open SWE 拿到的是一个聚焦的排查任务,而不是把所有飙升的错误一股脑倒过去。

用 Open SWE 完成闭环

分诊智能体一旦放行调查,Open SWE 就接管,排查并修复问题,然后开一个 PR。等它准备好评审时会收到通知,于是从错误检测到给出修复方案的整个流程都不需要手工介入。

目前最有价值的是抓到那些不会大声崩溃的 bug:悄无声息地返回错误默认值的失败,代码与部署之间的配置不匹配,以及那种连锁回归,修掉一个 bug 后在下一次部署里又暴露出另一个。

未来改进

更长的回溯窗口

分诊智能体现在只看当前版本和上一版本之间的差异。更早版本埋下、后来才浮现的 bug,不会被自动归因。把回溯窗口拉长是个显而易见的改法,但喂给分诊智能体的 diff 越多,信号就越嘈杂,越难钉住因果链。现在还没找到合适的平衡点。

更聪明的错误聚类

现在的做法是把错误信息里的 ID、时间戳清洗掉,用模糊匹配来分组。为了让它靠谱花了不少时间,但受限于清洗逻辑,相关错误没被归到一起的情况估计还存在。

一个在考虑的想法是把错误信息做 embedding,放到向量空间里做聚类,而不是依赖正则归一化。语义相同的错误会自然靠得很近,不管表面字符串差异多大。这样就能通过监控部署后是否出现新簇、或现有簇是否扩张来发现回归。难点在于距离阈值怎么调,才能区分有意义的簇变化和正常波动。

另一个选择是用更小的模型(大概率是开源的)来分类并分组错误,然后把这些结构化的簇直接作为排查提示的一部分交给 Open SWE,让它对哪里在坏、完整错误长什么样有更立体的认识。

这些方案都是在错误发生之后再改进分组。Ramp 用了一个挺有意思的反向思路,在错误发生之前先定义要观察什么。为了让他们的 Sheets 产品能自维护,每次 PR 合并时都会让一个 LLM 读 diff,基于变更代码生成对应的监控项,每个监控项都带着明确阈值,例如错误率飙升、延迟回归等。监控触发后,webhook 会把告警上下文直接送到一个智能体做分诊。提前定义有针对性的监控,会给出更清晰的信号,让下游智能体更容易诊断问题。

向前修复还是回滚

现在系统总是选择向前修复,Open SWE 在开 PR 的同时,出问题的部署仍然在线。更聪明的做法是根据严重程度、错误率和分诊置信度在两者之间做决策。严重的错误飙升但因果链置信度很低,可能该立刻回滚;而归因明确、修复路径清晰的 bug,更适合直接推进补丁。

把闭环当默认

模式很简单:部署、监控、分诊、修复,自动循环。我最初是为一个智能体部署做的,但它可以推广到任何需要部署代码的服务。每次部署都有同一个问题:总会有东西坏掉,需要有人发现,需要有人修。这个闭环自动化得越多,工程时间就越能从救火转到建设。系统也会更有韧性,因为从出问题到修好的反馈回路越来越接近于零。

链接 http://x.com/i/article/2039590995676053504

I built a self-healing deployment pipeline for our GTM Agent. After every deploy, it detects regressions, triages whether the change caused them, and kicks off an agent to open a PR with a fix.

给我们的 GTM Agent 搭了一个会自愈的部署流水线。每次部署后,它会检测回归,判断这些问题是不是这次改动引起的,然后启动另一个智能体去开一个包含修复的 PR。

With coding agents, the hard part of shipping isn't getting code out. It's everything after: figuring out if your last deploy broke something, investigating what caused the issue, and fixing it before users notice. I wanted to deploy, move on, and trust that if something regressed, the system would catch it and close the loop itself.

有了写代码的智能体,发布的难点不在把代码推上去,而在后面的一切:上一版部署有没有把东西搞坏,问题到底因何而起,能不能在用户察觉前修好。希望的是部署完就去做别的事,并且相信只要出现回归,系统会自己发现并把闭环跑完。

How the Self-Healing Flow Works

自愈流程如何运转

The GTM Agent is built on Deep Agents and deploys through LangSmith Deployments. We already had an internal coding agent called Open SWE, an open-source async coding agent that can research a codebase, write fixes, and open PRs. The missing piece was automated regression detection and triage to connect production errors back to Open SWE.

GTM Agent 是用 Deep Agents 搭的,通过 LangSmith Deployments 部署。团队里本来就有一个内部编码智能体 Open SWE,它是一个开源的异步编码智能体,可以调研代码库、写修复并开 PR。缺的那块是自动化的回归检测和分诊,把生产环境的错误自动关联到 Open SWE。

Right after a deployment to production, a self-healing GitHub Action triggers, capturing the build and server logs. The flow has two paths: (1) catching build failures immediately and (2) detecting server-side regressions over a window. If either path finds a real issue, Open SWE gets kicked off to fix it and open a PR.

每次部署到生产后,会立刻触发一个自愈的 GitHub Action,把构建日志和服务端日志都抓下来。流程分两条路:(1) 立刻捕捉构建失败,(2) 在一个时间窗口内监控服务端回归。任意一路确认是实打实的问题,就启动 Open SWE 去修复并开 PR。

Catching Docker Build Failures

捕捉 Docker 构建失败

First, I check the build logs to make sure the Docker images build properly. If the image fails to build, the pipeline automatically pipes the error logs from the CLI, fetches the git diff from the last commit to main, and hands it off to Open SWE, no human involved. Build failures are almost always caused by the most recent change, so a narrow diff gives Open SWE enough context to act on.

第一步看构建日志,确认 Docker 镜像能正常构建。要是镜像构建失败,流水线会自动从 CLI 把报错日志导出来,再把最近一次提交到 main 的 git diff 拉出来,一并交给 Open SWE,全程不需要人介入。构建失败几乎总是最近的改动造成的,所以给一个范围很窄的 diff,就足够 Open SWE 直接动手。

Monitoring for Post-Deploy Errors

监控部署后的错误

Server-side issues are trickier than build failures. A production system carries a background error rate—network timeouts, third-party API issues, transient failures, etc. In an ideal world you'd track and fix every single one, but when trying to answer "did my last deploy break something," you need to separate the errors your change caused from the noise that was already there. That's what this step does.

服务端问题比构建失败难处理。生产系统本身就有一条底噪式的错误率,网络超时、第三方 API 抖动、偶发失败等等。理想情况下每个都该跟踪并修掉,但要回答上一版部署有没有弄坏东西,就得把这次改动带来的错误,从本来就存在的噪声里分离出来。这一步干的就是这个。

First, I collect a baseline of all error logs from the past 7 days. These get normalized into error signatures, regex replaces UUIDs, timestamps, and long numeric strings, then truncates to 200 characters, so logically identical errors get bucketed together even when the specifics differ.

先收集过去 7 天所有错误日志作为基线。把它们归一化成错误签名,用正则把 UUID、时间戳和很长的数字串替换掉,再截断到 200 个字符。这样即使细节不同,逻辑上同一种错误也会被归到同一个桶里。

Next, I poll for errors from the current revision over a 60-minute window after deployment, normalizing the same way. Once that window closes, I have error counts from two very different time scales, a week of baseline data and an hour of post-deployment data. While I could naively compare these two numbers to detect if our latest change caused an error, I wanted to take a more principled approach (and brush up on my probability distributions🙃).

接着在部署后的 60 分钟窗口里轮询当前 revision 的错误,归一化方式和上面一致。窗口结束时,就有了两个完全不同时间尺度上的计数,一周的基线数据,以及部署后一小时的数据。虽然可以粗暴地直接比这两个数字来判断最新改动是否引入了错误,但更想用一个更有章法的方法,也顺便复习一下概率分布🙃。

Gating with a Poisson Test

用泊松检验做门槛

A Poisson distribution models how many times an event occurs in a fixed interval, given a known average rate (λ) and the assumption that events are independent:

泊松分布用来建模在一个固定时间区间内,某个事件发生的次数。在已知平均发生率(λ)并假设事件相互独立的前提下,它可以描述计数的分布:

Baseline production errors fit a Poisson model reasonably well. Using the 7-day baseline, I estimate the expected error rate per hour for each error signature, then scale it to the 60-minute post-deployment window. If the observed count significantly exceeds what the distribution predicts (p < 0.05), I flag it as a potential regression. For error signatures that are completely new (not present in the baseline at all), I flag them if they occur repeatedly in the monitoring window.

生产环境的基线错误与泊松模型的契合度还不错。用这 7 天基线数据,为每个错误签名估计每小时的期望错误率,然后按比例换算到部署后的 60 分钟窗口。如果观测到的计数显著高于分布预测(p < 0.05),就把它标记为可能的回归。对那些完全新的错误签名(基线里完全没出现过),只要在监控窗口里重复出现,就同样标记。

But server errors aren't always independent. Correlated failures from traffic spikes or API outages can violate the independence assumption, and a statistical test alone can't distinguish "this error spiked because of our code change" from "this error spiked because a third-party API went down." That's where the triage agent comes in.

不过服务端错误并不总是独立的。流量尖峰或 API 故障造成的相关性失败,会违背独立性假设;单靠统计检验也分不清错误飙升究竟是我们改动造成的,还是第三方 API 挂了。这时就轮到分诊智能体上场。

The Triage Agent

分诊智能体

Rather than feeding errors directly into Open SWE (which is tempted to make changes), I add another gating mechanism. The diffs from the last commit and the specific error get passed into a triage agent (built on Deep Agents).

没有把错误直接喂给 Open SWE,因为它很容易一上来就想改代码。这里又加了一道门槛,把最近一次提交的 diff 和具体错误一起交给一个分诊智能体(基于 Deep Agents)。

First, the triage agent classifies every changed file as runtime, prompt/config, test, docs, CI, etc. If a change only touches non-runtime files, it's extremely unlikely the deployment caused the error. This prevents false positives where the agent might hallucinate a causal chain from a test file to a production bug.

分诊智能体先把所有变更文件分类为 runtime、prompt/config、test、docs、CI 等。若改动只碰了非 runtime 文件,这次部署导致该错误的概率极低。这样能避免误报,防止智能体从一个测试文件胡乱脑补出一条通往生产故障的因果链。

For runtime changes, the agent must establish a concrete causal link between a specific line in the diff and the observed error.

如果确实改了 runtime,智能体必须在 diff 的某一行改动和观测到的错误之间建立明确的因果关联。

The agent returns a structured verdict with its decision, confidence, reasoning, and the error signatures it attributes to the change. This narrowing means Open SWE receives a focused investigation prompt rather than a dump of every error that spiked.

它会返回一个结构化结论,包括决策、置信度、理由,以及它认为由这次改动引起的错误签名。这样收敛之后,Open SWE 拿到的是一个聚焦的排查任务,而不是把所有飙升的错误一股脑倒过去。

Closing the Loop with Open SWE

用 Open SWE 完成闭环

Once the triage agent green-lights an investigation, Open SWE takes over, works through the bug, and opens a PR. I get notified when it's ready for review, so the entire flow from error detection to proposed fix happens without any manual intervention.

分诊智能体一旦放行调查,Open SWE 就接管,排查并修复问题,然后开一个 PR。等它准备好评审时会收到通知,于是从错误检测到给出修复方案的整个流程都不需要手工介入。

So far, it's been most useful for catching bugs that don't crash loudly: silent failures that return wrong defaults, configuration mismatches between code and deployment, and cascading regressions where fixing one bug unmasks the next on the subsequent deploy.

目前最有价值的是抓到那些不会大声崩溃的 bug:悄无声息地返回错误默认值的失败,代码与部署之间的配置不匹配,以及那种连锁回归,修掉一个 bug 后在下一次部署里又暴露出另一个。

Future Improvements

未来改进

Wider Look back Window

更长的回溯窗口

The triage agent currently looks at the difference between the current and previous version. Bugs introduced in earlier versions that only surface later won't get auto-attributed. Widening the look back is an obvious fix, but the more diffs you feed into the triage agent, the noisier the signal gets and the harder it is to pinpoint a causal link. I haven't landed on the right balance yet.

分诊智能体现在只看当前版本和上一版本之间的差异。更早版本埋下、后来才浮现的 bug,不会被自动归因。把回溯窗口拉长是个显而易见的改法,但喂给分诊智能体的 diff 越多,信号就越嘈杂,越难钉住因果链。现在还没找到合适的平衡点。

Smarter Error Grouping

更聪明的错误聚类

The current approach uses fuzzy matching by sanitizing IDs and timestamps from error messages. It took some time to get right, and there are probably still cases where related errors don't get grouped together due to limitations in the sanitization logic.

现在的做法是把错误信息里的 ID、时间戳清洗掉,用模糊匹配来分组。为了让它靠谱花了不少时间,但受限于清洗逻辑,相关错误没被归到一起的情况估计还存在。

One idea I've been considering is embedding error messages into a vector space and clustering them, rather than relying on regex normalization. Errors that mean the same thing would naturally land near each other regardless of surface-level differences, and I could detect regressions by monitoring for new clusters forming or existing clusters growing after a deploy. The challenge is tuning distance thresholds for what constitutes a meaningful cluster shift versus normal variance.

一个在考虑的想法是把错误信息做 embedding,放到向量空间里做聚类,而不是依赖正则归一化。语义相同的错误会自然靠得很近,不管表面字符串差异多大。这样就能通过监控部署后是否出现新簇、或现有簇是否扩张来发现回归。难点在于距离阈值怎么调,才能区分有意义的簇变化和正常波动。

Another option is using a smaller model (likely open source) to classify and group errors, then pass those structured clusters directly to Open SWE as part of the investigation prompt, giving it a much richer picture of what's failing and how the full error looks.

另一个选择是用更小的模型(大概率是开源的)来分类并分组错误,然后把这些结构化的簇直接作为排查提示的一部分交给 Open SWE,让它对哪里在坏、完整错误长什么样有更立体的认识。

All of these approaches improve grouping after errors happen. Ramp took an interesting approach that works the other way around, defining what to watch for before errors happen. To make their Sheets product self-maintaining, on every PR merge an LLM reads the diff and generates monitors tailored to the changed code, each with explicit thresholds for error rate spikes, latency regressions, etc. When a monitor fires, a webhook delivers the alert context directly to an agent for triage. Defining a targeted monitor upfront produces a much clearer signal, making it easier for a downstream agent to diagnose the issue.

这些方案都是在错误发生之后再改进分组。Ramp 用了一个挺有意思的反向思路,在错误发生之前先定义要观察什么。为了让他们的 Sheets 产品能自维护,每次 PR 合并时都会让一个 LLM 读 diff,基于变更代码生成对应的监控项,每个监控项都带着明确阈值,例如错误率飙升、延迟回归等。监控触发后,webhook 会把告警上下文直接送到一个智能体做分诊。提前定义有针对性的监控,会给出更清晰的信号,让下游智能体更容易诊断问题。

Fix-Forward vs Looking Back

向前修复还是回滚

Right now the system always fixes forward, Open SWE works on a PR while the broken deployment stays live. A smarter approach would be deciding between the two based on severity, error rate, and triage confidence. A high-severity spike with a low-confidence causal chain might warrant an immediate rollback, while a well-attributed bug with a clear fix path is better handled by pushing a patch forward.

现在系统总是选择向前修复,Open SWE 在开 PR 的同时,出问题的部署仍然在线。更聪明的做法是根据严重程度、错误率和分诊置信度在两者之间做决策。严重的错误飙升但因果链置信度很低,可能该立刻回滚;而归因明确、修复路径清晰的 bug,更适合直接推进补丁。

The Loop as Default

把闭环当默认

The pattern is simple: deploy, monitor, triage, and fix—automatically in a loop. I built this for a single agent deployment, but it generalizes to any service that deploys code. Every deployment has the same problem. Something breaks, someone has to notice, someone has to fix it. The more of that loop you automate, the more engineering time shifts from reacting to building. Systems get more resilient because the feedback loop between breaking and fixing approaches zero.

模式很简单:部署、监控、分诊、修复,自动循环。我最初是为一个智能体部署做的,但它可以推广到任何需要部署代码的服务。每次部署都有同一个问题:总会有东西坏掉,需要有人发现,需要有人修。这个闭环自动化得越多,工程时间就越能从救火转到建设。系统也会更有韧性,因为从出问题到修好的反馈回路越来越接近于零。

链接 http://x.com/i/article/2039590995676053504

I built a self-healing deployment pipeline for our GTM Agent. After every deploy, it detects regressions, triages whether the change caused them, and kicks off an agent to open a PR with a fix.

With coding agents, the hard part of shipping isn't getting code out. It's everything after: figuring out if your last deploy broke something, investigating what caused the issue, and fixing it before users notice. I wanted to deploy, move on, and trust that if something regressed, the system would catch it and close the loop itself.

How the Self-Healing Flow Works

The GTM Agent is built on Deep Agents and deploys through LangSmith Deployments. We already had an internal coding agent called Open SWE, an open-source async coding agent that can research a codebase, write fixes, and open PRs. The missing piece was automated regression detection and triage to connect production errors back to Open SWE.

https://x.com/RampLabs/status/2036165188899012655

Right after a deployment to production, a self-healing GitHub Action triggers, capturing the build and server logs. The flow has two paths: (1) catching build failures immediately and (2) detecting server-side regressions over a window. If either path finds a real issue, Open SWE gets kicked off to fix it and open a PR.

Catching Docker Build Failures

First, I check the build logs to make sure the Docker images build properly. If the image fails to build, the pipeline automatically pipes the error logs from the CLI, fetches the git diff from the last commit to main, and hands it off to Open SWE, no human involved. Build failures are almost always caused by the most recent change, so a narrow diff gives Open SWE enough context to act on.

Monitoring for Post-Deploy Errors

Server-side issues are trickier than build failures. A production system carries a background error rate—network timeouts, third-party API issues, transient failures, etc. In an ideal world you'd track and fix every single one, but when trying to answer "did my last deploy break something," you need to separate the errors your change caused from the noise that was already there. That's what this step does.

First, I collect a baseline of all error logs from the past 7 days. These get normalized into error signatures, regex replaces UUIDs, timestamps, and long numeric strings, then truncates to 200 characters, so logically identical errors get bucketed together even when the specifics differ.

Next, I poll for errors from the current revision over a 60-minute window after deployment, normalizing the same way. Once that window closes, I have error counts from two very different time scales, a week of baseline data and an hour of post-deployment data. While I could naively compare these two numbers to detect if our latest change caused an error, I wanted to take a more principled approach (and brush up on my probability distributions🙃).

Gating with a Poisson Test

A Poisson distribution models how many times an event occurs in a fixed interval, given a known average rate (λ) and the assumption that events are independent:

https://x.com/LangChain/status/2031055593360990358

Baseline production errors fit a Poisson model reasonably well. Using the 7-day baseline, I estimate the expected error rate per hour for each error signature, then scale it to the 60-minute post-deployment window. If the observed count significantly exceeds what the distribution predicts (p < 0.05), I flag it as a potential regression. For error signatures that are completely new (not present in the baseline at all), I flag them if they occur repeatedly in the monitoring window.

https://www.langchain.com/langsmith/deployment

But server errors aren't always independent. Correlated failures from traffic spikes or API outages can violate the independence assumption, and a statistical test alone can't distinguish "this error spiked because of our code change" from "this error spiked because a third-party API went down." That's where the triage agent comes in.

The Triage Agent

Rather than feeding errors directly into Open SWE (which is tempted to make changes), I add another gating mechanism. The diffs from the last commit and the specific error get passed into a triage agent (built on Deep Agents).

First, the triage agent classifies every changed file as runtime, prompt/config, test, docs, CI, etc. If a change only touches non-runtime files, it's extremely unlikely the deployment caused the error. This prevents false positives where the agent might hallucinate a causal chain from a test file to a production bug.

For runtime changes, the agent must establish a concrete causal link between a specific line in the diff and the observed error.

The agent returns a structured verdict with its decision, confidence, reasoning, and the error signatures it attributes to the change. This narrowing means Open SWE receives a focused investigation prompt rather than a dump of every error that spiked.

Closing the Loop with Open SWE

Once the triage agent green-lights an investigation, Open SWE takes over, works through the bug, and opens a PR. I get notified when it's ready for review, so the entire flow from error detection to proposed fix happens without any manual intervention.

So far, it's been most useful for catching bugs that don't crash loudly: silent failures that return wrong defaults, configuration mismatches between code and deployment, and cascading regressions where fixing one bug unmasks the next on the subsequent deploy.

Future Improvements

Wider Look back Window

The triage agent currently looks at the difference between the current and previous version. Bugs introduced in earlier versions that only surface later won't get auto-attributed. Widening the look back is an obvious fix, but the more diffs you feed into the triage agent, the noisier the signal gets and the harder it is to pinpoint a causal link. I haven't landed on the right balance yet.

Smarter Error Grouping

The current approach uses fuzzy matching by sanitizing IDs and timestamps from error messages. It took some time to get right, and there are probably still cases where related errors don't get grouped together due to limitations in the sanitization logic.

One idea I've been considering is embedding error messages into a vector space and clustering them, rather than relying on regex normalization. Errors that mean the same thing would naturally land near each other regardless of surface-level differences, and I could detect regressions by monitoring for new clusters forming or existing clusters growing after a deploy. The challenge is tuning distance thresholds for what constitutes a meaningful cluster shift versus normal variance.

Another option is using a smaller model (likely open source) to classify and group errors, then pass those structured clusters directly to Open SWE as part of the investigation prompt, giving it a much richer picture of what's failing and how the full error looks.

All of these approaches improve grouping after errors happen. Ramp took an interesting approach that works the other way around, defining what to watch for before errors happen. To make their Sheets product self-maintaining, on every PR merge an LLM reads the diff and generates monitors tailored to the changed code, each with explicit thresholds for error rate spikes, latency regressions, etc. When a monitor fires, a webhook delivers the alert context directly to an agent for triage. Defining a targeted monitor upfront produces a much clearer signal, making it easier for a downstream agent to diagnose the issue.

Fix-Forward vs Looking Back

Right now the system always fixes forward, Open SWE works on a PR while the broken deployment stays live. A smarter approach would be deciding between the two based on severity, error rate, and triage confidence. A high-severity spike with a low-confidence causal chain might warrant an immediate rollback, while a well-attributed bug with a clear fix path is better handled by pushing a patch forward.

The Loop as Default

The pattern is simple: deploy, monitor, triage, and fix—automatically in a loop. I built this for a single agent deployment, but it generalizes to any service that deploys code. Every deployment has the same problem. Something breaks, someone has to notice, someone has to fix it. The more of that loop you automate, the more engineering time shifts from reacting to building. Systems get more resilient because the feedback loop between breaking and fixing approaches zero.

📋 讨论归档

讨论进行中…