返回列表
🧠 阿头学 · 💬 讨论题

Anthropic 把 Agent 从“会聊天”推进到“可验收交付”

这份文档最有价值的判断是:Anthropic 正在把 Agent 产品化为“有验收标准的工作流”,但它同时把成本、误判和调试黑盒风险明显留给了开发者自己承担。
打开原文 ↗

2026-05-07 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 从对话升级为工作 文档最关键的创新不是新 API 细节,而是 `outcome + rubric + grader` 这套结构把“聊一聊”变成“交付一个结果”;这个方向是对的,因为企业真正愿意付费的是可验收产出,不是漂亮对话。
  • 独立评分器是架构亮点 Anthropic 明确把执行代理和评分器分开,并强调独立上下文窗口,这个设计确实比“同一个模型自己做自己评”更可靠;但“更可靠”不等于“足够可靠”,因为独立上下文只能减少自我确认偏误,不能消灭模型整体能力上限。
  • Rubric 成了新 Prompt 工程核心 这套系统的真实瓶颈不是 agent 会不会调用工具,而是你能不能写出可验证、少歧义、不过度主观的评分标准;如果 rubric 写得烂,系统只会更高效地把错误目标执行到底。
  • 事件流把概率模型包装成状态机 `satisfied / needs_revision / failed / interrupted / max_iterations_reached` 这套事件设计非常工程化,明显提高了可监控性和编排能力;这对生产系统是实打实的进步,不是包装词。
  • 黑盒评审带来真实风险 文档直接承认评分器内部推理不可见,这意味着一旦 grader 误判,开发者几乎没有精确 debug 抓手;这个缺陷不是小瑕疵,而是生产落地时会放大的核心风险。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 如果要做可卖的 Agent 产品,重点不该再放在“多智能”叙事,而该放在“如何定义完成”;下一步最应该建立一套行业化 rubric 模板库,把交付标准产品化,而不是只卷模型能力。
  • 对 Neta 意味着什么、下一步怎么用 Neta 如果关心系统设计,这页文档说明可靠 Agent 的核心是“执行-评审分离”;下一步可以把这个模式迁移到内容审核、研究摘要、销售外联等流程,但必须额外加人工抽检,否则会被 grader 假阳性坑死。
  • 对 Uota 意味着什么、下一步怎么用 Uota 如果从用户体验看,这种模式意味着未来 Agent UI 更像任务面板和验收流,而不是聊天框;下一步可以思考怎样把“标准、进度、差距、交付物”做成更直观的产品体验,而不是继续迷信流式打字效果。
  • 对投资判断意味着什么、下一步怎么用 这说明 Managed Agents 正在从模型 API 往“工作流基础设施”升级,这对平台估值是利好;但下一步要盯的不是功能发布频率,而是客户是否真能用它稳定降本,否则这更像拉高 token 消耗的精致包装。

讨论引子

1. 如果 grader 和 agent 本质上还是同等级模型,“独立评分”到底是在提升质量,还是只是在制造质量幻觉? 2. 在真实业务里,rubric 应该由产品经理写、领域专家写,还是由 AI 先起草再人工修订? 3. 当系统开始优化 rubric 而不是真实目标时,企业应该怎么设计人工兜底和熔断机制?

加载中...

API 参考English控制台登录

搜索...

概览快速开始在控制台中原型验证

代理设置工具MCP 连接器权限策略代理技能

云环境设置容器参考

启动会话会话事件流订阅 Webhook定义产出目标使用保管库进行身份验证

访问 GitHub附加与下载文件

Files APIPDF 支持图像与视觉

概览最佳实践企业版技能

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

解决方案

合作伙伴

学习

公司

学习

帮助与安全

条款与政策

定义产出目标

告诉代理 什么样才算完成,然后让它持续迭代,直到真正达到目标。

复制页面

outcome 将一个会话从_对话_提升为_工作_。你来定义最终结果应该是什么样,以及如何衡量质量。代理会朝这个目标推进,自我评估并持续迭代,直到满足该产出目标。

当你定义一个产出目标时,运行框架会自动配置一个_grader_,按照评分标准评估产物。它使用独立的上下文窗口,避免受到主代理实现选择的影响。

评分器会按每项标准返回拆解结果,或者确认产物满足评分标准,或者指出当前工作与要求之间的具体差距。这些反馈会再交还给代理,用于下一轮迭代。

所有 Managed Agents API 请求都需要 managed-agents-2026-04-01 beta 请求头。SDK 会自动设置这个 beta 请求头。

创建评分标准

评分标准是一份 Markdown 文档,用来描述按项打分的标准。评分标准是必需的。

编写有效评分标准的建议

评分标准示例:

# DCF Model Rubric

## Revenue Projections
- Uses historical revenue data from the last 5 fiscal years
- Projects revenue for at least 5 years forward
- Growth rate assumptions are explicitly stated and reasonable

## Cost Structure
- COGS and operating expenses are modeled separately
- Margins are consistent with historical trends or deviations are justified

## Discount Rate
- WACC is calculated with stated assumptions for cost of equity and cost of debt
- Beta, risk-free rate, and equity risk premium are sourced or justified

## Terminal Value
- Uses either perpetuity growth or exit multiple method (stated which)
- Terminal growth rate does not exceed long-term GDP growth

## Output Quality
- All figures are in a single .xlsx file with clearly labeled sheets
- Key assumptions are on a separate "Assumptions" sheet
- Sensitivity analysis on WACC and terminal growth rate is included

将评分标准以内联文本形式传给 user.define_outcome,见下一节。也可以通过 Files API 上传,便于跨会话复用。

需要 beta 请求头 files-api-2025-04-14

curl CLI Python TypeScript C#Go Java PHP Ruby

rubric = client.beta.files.upload(file=Path("/tmp/rubric.md"))
print(f"Uploaded rubric: {rubric.id}")

创建带有产出目标的会话

创建会话后,发送一个 user.define_outcome 事件。代理会立刻开始工作,不需要额外再发送用户消息事件。

curl CLI Python TypeScript C#Go Java PHP Ruby

# Create a session
session = client.beta.sessions.create(
    agent=agent.id,
    environment_id=environment.id,
    title="Financial analysis on Costco",
)

# Define the outcome — agent starts working on receipt
client.beta.sessions.events.send(
    session_id=session.id,
    events=[
        {
            "type": "user.define_outcome",
            "description": "Build a DCF model for Costco in .xlsx",
            "rubric": {"type": "text", "content": RUBRIC},
            # or: "rubric": {"type": "file", "file_id": rubric.id},
            "max_iterations": 5,  # optional; default 3, max 20
        }
    ],
)

产出目标事件

面向产出目标的会话进度,会显示在事件中。

  • agent.* 事件,消息、工具使用等,展示朝产出目标推进的过程。
  • span.outcome_evaluation_* 事件只会在面向产出目标的会话中发出,用于显示迭代循环次数以及评分器的反馈过程。
  • 你也可以向面向产出目标的会话发送 user.message 事件,在推进过程中引导代理工作,不过这不是那么必要,因为代理知道要一直工作,直到耗尽迭代次数或达成产出目标。
  • user.interrupt 事件会暂停当前产出目标上的工作,并将 span.outcome_evaluation_end.result 标记为 interrupted,这样你就可以发起一个新的产出目标。
  • 最终一次产出评估结束后,会话可以继续作为对话会话使用,也可以发起新的产出目标。会话会保留此前产出目标的历史记录。

定义产出目标用户事件

一次只能支持一个产出目标,但你可以按顺序串联多个产出目标。做法是在上一个产出目标的终止事件之后,发送新的 user.define_outcome 事件。

这是你用来发起产出目标的事件。收到后,它会被原样回显,并附带 processed_at 时间戳和 outcome_id

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "file", "file_id": "file_01..." },
  "max_iterations": 5
}

产出评估开始

当评分器开始对某一轮迭代进行评估时,会发出这个事件。iteration 字段是从 0 开始计数的修订序号。0 表示第一次评估,1 表示第一次修订后的重新评估,依此类推。

{
  "type": "span.outcome_evaluation_start",
  "id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "iteration": 0,
  "processed_at": "2026-03-25T14:01:45Z"
}

产出评估进行中

评分器运行期间会发出心跳事件。评分器内部的推理过程对外不可见。你只能看到它正在工作,看不到它在想什么。

{
  "type": "span.outcome_evaluation_ongoing",
  "id": "sevt_01ghi...",
  "outcome_id": "outc_01a...",
  "processed_at": "2026-03-25T14:02:10Z"
}

产出评估结束

评分器完成某一轮迭代评估后,会发出这个事件。result 字段表示接下来会发生什么。

结果 下一步
satisfied 会话切换为 idle
needs_revision 代理开始新一轮迭代。
max_iterations_reached 不再继续评估循环。会话切换为 idle 之前,代理可能还会进行最后一次修订。
failed 会话切换为 idle。当评分标准从根本上不匹配任务时返回这个结果,比如描述和评分标准彼此矛盾。
interrupted 只有在中断前已经发出 outcome_evaluation_start 时才会发出。
{
  "type": "span.outcome_evaluation_end",
  "id": "sevt_01jkl...",
  "outcome_evaluation_start_id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "result": "satisfied",
  "explanation": "All 12 criteria met: revenue projections use 5 years of historical data, WACC assumptions are stated, sensitivity table is included...",
  "iteration": 0,
  "usage": {
    "input_tokens": 2400,
    "output_tokens": 350,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 1800
  },
  "processed_at": "2026-03-25T14:03:00Z"
}

检查产出目标状态

你可以监听事件中的 span.outcome_evaluation_end,也可以轮询 GET /v1/sessions/:id,读取 outcome_evaluations[].result

curl CLI Python TypeScript C#Go Java PHP Ruby

session = client.beta.sessions.retrieve(session.id)

for outcome in session.outcome_evaluations:
    print(f"{outcome.outcome_id}: {outcome.result}")
    # outc_01a...: satisfied

获取交付物

代理会将输出文件写入容器内的 /mnt/session/outputs/。会话进入空闲后,通过作用域限定到该会话的 Files API 获取这些文件:

curl CLI Python TypeScript C#Go Java PHP Ruby

```

List files produced by this session

files = client.beta.files.list(scope_id=session.id) for f in files: print(f.id, f.filename)

Download a file

if files.data: content = client.beta.files.download(files.data[0].id) content.write_to_file("/tmp/output.txt")

Tell the agent what 'done' looks like, and let it iterate until it gets there.

Copy page

The outcome elevates a session from conversation to work. You define what the end result should look like and how to measure quality. The agent works toward that target, self-evaluating and iterating until the outcome is met.

When you define an outcome, the harness automatically provisions a grader to evaluate the artifact against a rubric. It leverages a separate context window to avoid being influenced by the main agent's implementation choices.

The grader returns a per-criterion breakdown: either confirmation that the artifact satisfies the rubric, or the specific gaps between the current work and the requirements. That feedback is handed back to the agent for the next iteration.

All Managed Agents API requests require the managed-agents-2026-04-01 beta header. The SDK sets this beta header automatically.

加载中...

API 参考English控制台登录

搜索...

概览快速开始在控制台中原型验证

代理设置工具MCP 连接器权限策略代理技能

云环境设置容器参考

启动会话会话事件流订阅 Webhook定义产出目标使用保管库进行身份验证

访问 GitHub附加与下载文件

Files APIPDF 支持图像与视觉

概览最佳实践企业版技能

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

加载中...

解决方案

合作伙伴

学习

公司

学习

帮助与安全

条款与政策

Create a rubric

A rubric is a markdown document describing per-criterion scoring. The rubric is required.

Tips for writing effective rubrics

Example rubric:

```

DCF Model Rubric

定义产出目标

告诉代理 什么样才算完成,然后让它持续迭代,直到真正达到目标。

复制页面

outcome 将一个会话从_对话_提升为_工作_。你来定义最终结果应该是什么样,以及如何衡量质量。代理会朝这个目标推进,自我评估并持续迭代,直到满足该产出目标。

当你定义一个产出目标时,运行框架会自动配置一个_grader_,按照评分标准评估产物。它使用独立的上下文窗口,避免受到主代理实现选择的影响。

评分器会按每项标准返回拆解结果,或者确认产物满足评分标准,或者指出当前工作与要求之间的具体差距。这些反馈会再交还给代理,用于下一轮迭代。

所有 Managed Agents API 请求都需要 managed-agents-2026-04-01 beta 请求头。SDK 会自动设置这个 beta 请求头。

Revenue Projections

  • Uses historical revenue data from the last 5 fiscal years
  • Projects revenue for at least 5 years forward
  • Growth rate assumptions are explicitly stated and reasonable

创建评分标准

评分标准是一份 Markdown 文档,用来描述按项打分的标准。评分标准是必需的。

编写有效评分标准的建议

评分标准示例:

```

DCF Model Rubric

Cost Structure

  • COGS and operating expenses are modeled separately
  • Margins are consistent with historical trends or deviations are justified

Revenue Projections

  • Uses historical revenue data from the last 5 fiscal years
  • Projects revenue for at least 5 years forward
  • Growth rate assumptions are explicitly stated and reasonable

Discount Rate

  • WACC is calculated with stated assumptions for cost of equity and cost of debt
  • Beta, risk-free rate, and equity risk premium are sourced or justified

Cost Structure

  • COGS and operating expenses are modeled separately
  • Margins are consistent with historical trends or deviations are justified

Terminal Value

  • Uses either perpetuity growth or exit multiple method (stated which)
  • Terminal growth rate does not exceed long-term GDP growth

Discount Rate

  • WACC is calculated with stated assumptions for cost of equity and cost of debt
  • Beta, risk-free rate, and equity risk premium are sourced or justified

Output Quality

  • All figures are in a single .xlsx file with clearly labeled sheets
  • Key assumptions are on a separate "Assumptions" sheet
  • Sensitivity analysis on WACC and terminal growth rate is included

Pass the rubric as inline text on `user.define_outcome` (see the next section), or upload it via the Files API for reuse across sessions:

**Requires beta header `files-api-2025-04-14`.**

curl CLI Python TypeScript C#Go Java PHP Ruby

rubric = client.beta.files.upload(file=Path("/tmp/rubric.md")) print(f"Uploaded rubric: {rubric.id}") ```

Terminal Value

  • Uses either perpetuity growth or exit multiple method (stated which)
  • Terminal growth rate does not exceed long-term GDP growth

Create a session with an outcome

After creating a session, send a user.define_outcome event. The agent begins work immediately; no additional user message event is required.

curl CLI Python TypeScript C#Go Java PHP Ruby

# Create a session
session = client.beta.sessions.create(
    agent=agent.id,
    environment_id=environment.id,
    title="Financial analysis on Costco",
)

# Define the outcome — agent starts working on receipt
client.beta.sessions.events.send(
    session_id=session.id,
    events=[
        {
            "type": "user.define_outcome",
            "description": "Build a DCF model for Costco in .xlsx",
            "rubric": {"type": "text", "content": RUBRIC},
            # or: "rubric": {"type": "file", "file_id": rubric.id},
            "max_iterations": 5,  # optional; default 3, max 20
        }
    ],
)

Output Quality

  • All figures are in a single .xlsx file with clearly labeled sheets
  • Key assumptions are on a separate "Assumptions" sheet
  • Sensitivity analysis on WACC and terminal growth rate is included

将评分标准以内联文本形式传给 `user.define_outcome`,见下一节。也可以通过 Files API 上传,便于跨会话复用。

**需要 beta 请求头 `files-api-2025-04-14`。**

curl CLI Python TypeScript C#Go Java PHP Ruby

rubric = client.beta.files.upload(file=Path("/tmp/rubric.md")) print(f"Uploaded rubric: {rubric.id}") ```

Outcome events

Progress on an outcome-oriented session is surfaced on the events stream.

  • agent.* events (messages, tool use, etc.) show progress towards the outcome.
  • span.outcome_evaluation_* events are only emitted for outcome-oriented sessions and show the number of iteration loops and the grader's feedback process.
  • You can also send user.messageevents to an outcome-oriented session, to direct the agent's work as it progresses, but these are not as necessary; the agent knows to work until it has exhausted its iterations or achieved the outcome.
  • A user.interrupt event will pause work on the current outcome and mark the span.outcome_evaluation_end.result as interrupted, allowing you to kick off a new outcome.
  • After the final outcome evaluation, the session can be continued as a conversational session, or a new outcome can be kicked off. The session will retain history of the prior outcome.

Define outcome user event

Only one outcome supported at a time, but you may chain together outcomes in sequence. To do this, send a new user.define_outcome event after the terminal event of the previous outcome.

This is the event you send to initiate an outcome. It is echoed back on receipt, including a processed_at timestamp and outcome_id.

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "file", "file_id": "file_01..." },
  "max_iterations": 5
}

Outcome evaluation start

Emitted once the grader starts an evaluation over one iteration loop. The iteration field is a 0-indexed revision counter: 0 is the first evaluation, 1 is the re-evaluation after the first revision, and so on.

{
  "type": "span.outcome_evaluation_start",
  "id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "iteration": 0,
  "processed_at": "2026-03-25T14:01:45Z"
}

Outcome evaluation ongoing

Heartbeat emitted while the grader runs. The grader's internal reasoning is opaque: you see that it's working, not what it's thinking.

{
  "type": "span.outcome_evaluation_ongoing",
  "id": "sevt_01ghi...",
  "outcome_id": "outc_01a...",
  "processed_at": "2026-03-25T14:02:10Z"
}

Outcome evaluation end

Emitted after the grader finishes evaluating one iteration. The result field indicates what happens next.

Result Next
satisfied Session transitions to idle.
needs_revision Agent starts a new iteration cycle.
max_iterations_reached No further evaluation cycles. The agent may run one final revision before the session transitions to idle.
failed Session transitions to idle. Returned when the rubric fundamentally does not match the task, for example if the description and rubric contradict each other.
interrupted Only emitted if outcome_evaluation_start already fired before the interrupt.
{
  "type": "span.outcome_evaluation_end",
  "id": "sevt_01jkl...",
  "outcome_evaluation_start_id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "result": "satisfied",
  "explanation": "All 12 criteria met: revenue projections use 5 years of historical data, WACC assumptions are stated, sensitivity table is included...",
  "iteration": 0,
  "usage": {
    "input_tokens": 2400,
    "output_tokens": 350,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 1800
  },
  "processed_at": "2026-03-25T14:03:00Z"
}

创建带有产出目标的会话

创建会话后,发送一个 user.define_outcome 事件。代理会立刻开始工作,不需要额外再发送用户消息事件。

curl CLI Python TypeScript C#Go Java PHP Ruby

# Create a session
session = client.beta.sessions.create(
    agent=agent.id,
    environment_id=environment.id,
    title="Financial analysis on Costco",
)

# Define the outcome — agent starts working on receipt
client.beta.sessions.events.send(
    session_id=session.id,
    events=[
        {
            "type": "user.define_outcome",
            "description": "Build a DCF model for Costco in .xlsx",
            "rubric": {"type": "text", "content": RUBRIC},
            # or: "rubric": {"type": "file", "file_id": rubric.id},
            "max_iterations": 5,  # optional; default 3, max 20
        }
    ],
)

Checking on outcome status

You can either listen on the event stream for span.outcome_evaluation_end, or poll GET /v1/sessions/:id and read outcome_evaluations[].result:

curl CLI Python TypeScript C#Go Java PHP Ruby

session = client.beta.sessions.retrieve(session.id)

for outcome in session.outcome_evaluations:
    print(f"{outcome.outcome_id}: {outcome.result}")
    # outc_01a...: satisfied

产出目标事件

面向产出目标的会话进度,会显示在事件中。

  • agent.* 事件,消息、工具使用等,展示朝产出目标推进的过程。
  • span.outcome_evaluation_* 事件只会在面向产出目标的会话中发出,用于显示迭代循环次数以及评分器的反馈过程。
  • 你也可以向面向产出目标的会话发送 user.message 事件,在推进过程中引导代理工作,不过这不是那么必要,因为代理知道要一直工作,直到耗尽迭代次数或达成产出目标。
  • user.interrupt 事件会暂停当前产出目标上的工作,并将 span.outcome_evaluation_end.result 标记为 interrupted,这样你就可以发起一个新的产出目标。
  • 最终一次产出评估结束后,会话可以继续作为对话会话使用,也可以发起新的产出目标。会话会保留此前产出目标的历史记录。

定义产出目标用户事件

一次只能支持一个产出目标,但你可以按顺序串联多个产出目标。做法是在上一个产出目标的终止事件之后,发送新的 user.define_outcome 事件。

这是你用来发起产出目标的事件。收到后,它会被原样回显,并附带 processed_at 时间戳和 outcome_id

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "file", "file_id": "file_01..." },
  "max_iterations": 5
}

产出评估开始

当评分器开始对某一轮迭代进行评估时,会发出这个事件。iteration 字段是从 0 开始计数的修订序号。0 表示第一次评估,1 表示第一次修订后的重新评估,依此类推。

{
  "type": "span.outcome_evaluation_start",
  "id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "iteration": 0,
  "processed_at": "2026-03-25T14:01:45Z"
}

产出评估进行中

评分器运行期间会发出心跳事件。评分器内部的推理过程对外不可见。你只能看到它正在工作,看不到它在想什么。

{
  "type": "span.outcome_evaluation_ongoing",
  "id": "sevt_01ghi...",
  "outcome_id": "outc_01a...",
  "processed_at": "2026-03-25T14:02:10Z"
}

产出评估结束

评分器完成某一轮迭代评估后,会发出这个事件。result 字段表示接下来会发生什么。

结果 下一步
satisfied 会话切换为 idle
needs_revision 代理开始新一轮迭代。
max_iterations_reached 不再继续评估循环。会话切换为 idle 之前,代理可能还会进行最后一次修订。
failed 会话切换为 idle。当评分标准从根本上不匹配任务时返回这个结果,比如描述和评分标准彼此矛盾。
interrupted 只有在中断前已经发出 outcome_evaluation_start 时才会发出。
{
  "type": "span.outcome_evaluation_end",
  "id": "sevt_01jkl...",
  "outcome_evaluation_start_id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "result": "satisfied",
  "explanation": "All 12 criteria met: revenue projections use 5 years of historical data, WACC assumptions are stated, sensitivity table is included...",
  "iteration": 0,
  "usage": {
    "input_tokens": 2400,
    "output_tokens": 350,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 1800
  },
  "processed_at": "2026-03-25T14:03:00Z"
}

Retrieving deliverables

The agent writes output files to /mnt/session/outputs/ inside the container. Once the session is idle, fetch them via the Files API scoped to the session:

curl CLI Python TypeScript C#Go Java PHP Ruby

# List files produced by this session
files = client.beta.files.list(scope_id=session.id)
for f in files:
    print(f.id, f.filename)

# Download a file
if files.data:
    content = client.beta.files.download(files.data[0].id)
    content.write_to_file("/tmp/output.txt")

Was this page helpful?

检查产出目标状态

你可以监听事件中的 span.outcome_evaluation_end,也可以轮询 GET /v1/sessions/:id,读取 outcome_evaluations[].result

curl CLI Python TypeScript C#Go Java PHP Ruby

session = client.beta.sessions.retrieve(session.id)

for outcome in session.outcome_evaluations:
    print(f"{outcome.outcome_id}: {outcome.result}")
    # outc_01a...: satisfied

获取交付物

代理会将输出文件写入容器内的 /mnt/session/outputs/。会话进入空闲后,通过作用域限定到该会话的 Files API 获取这些文件:

curl CLI Python TypeScript C#Go Java PHP Ruby

```

List files produced by this session

files = client.beta.files.list(scope_id=session.id) for f in files: print(f.id, f.filename)

Download a file

if files.data: content = client.beta.files.download(files.data[0].id) content.write_to_file("/tmp/output.txt")

Tell the agent what 'done' looks like, and let it iterate until it gets there.

Copy page

The outcome elevates a session from conversation to work. You define what the end result should look like and how to measure quality. The agent works toward that target, self-evaluating and iterating until the outcome is met.

When you define an outcome, the harness automatically provisions a grader to evaluate the artifact against a rubric. It leverages a separate context window to avoid being influenced by the main agent's implementation choices.

The grader returns a per-criterion breakdown: either confirmation that the artifact satisfies the rubric, or the specific gaps between the current work and the requirements. That feedback is handed back to the agent for the next iteration.

All Managed Agents API requests require the managed-agents-2026-04-01 beta header. The SDK sets this beta header automatically.

Create a rubric

A rubric is a markdown document describing per-criterion scoring. The rubric is required.

Tips for writing effective rubrics

Example rubric:

# DCF Model Rubric

## Revenue Projections
- Uses historical revenue data from the last 5 fiscal years
- Projects revenue for at least 5 years forward
- Growth rate assumptions are explicitly stated and reasonable

## Cost Structure
- COGS and operating expenses are modeled separately
- Margins are consistent with historical trends or deviations are justified

## Discount Rate
- WACC is calculated with stated assumptions for cost of equity and cost of debt
- Beta, risk-free rate, and equity risk premium are sourced or justified

## Terminal Value
- Uses either perpetuity growth or exit multiple method (stated which)
- Terminal growth rate does not exceed long-term GDP growth

## Output Quality
- All figures are in a single .xlsx file with clearly labeled sheets
- Key assumptions are on a separate "Assumptions" sheet
- Sensitivity analysis on WACC and terminal growth rate is included

Pass the rubric as inline text on user.define_outcome (see the next section), or upload it via the Files API for reuse across sessions:

Requires beta header files-api-2025-04-14.

curl CLI Python TypeScript C#Go Java PHP Ruby

rubric = client.beta.files.upload(file=Path("/tmp/rubric.md"))
print(f"Uploaded rubric: {rubric.id}")

Create a session with an outcome

After creating a session, send a user.define_outcome event. The agent begins work immediately; no additional user message event is required.

curl CLI Python TypeScript C#Go Java PHP Ruby

# Create a session
session = client.beta.sessions.create(
    agent=agent.id,
    environment_id=environment.id,
    title="Financial analysis on Costco",
)

# Define the outcome — agent starts working on receipt
client.beta.sessions.events.send(
    session_id=session.id,
    events=[
        {
            "type": "user.define_outcome",
            "description": "Build a DCF model for Costco in .xlsx",
            "rubric": {"type": "text", "content": RUBRIC},
            # or: "rubric": {"type": "file", "file_id": rubric.id},
            "max_iterations": 5,  # optional; default 3, max 20
        }
    ],
)

Outcome events

Progress on an outcome-oriented session is surfaced on the events stream.

  • agent.* events (messages, tool use, etc.) show progress towards the outcome.
  • span.outcome_evaluation_* events are only emitted for outcome-oriented sessions and show the number of iteration loops and the grader's feedback process.
  • You can also send user.messageevents to an outcome-oriented session, to direct the agent's work as it progresses, but these are not as necessary; the agent knows to work until it has exhausted its iterations or achieved the outcome.
  • A user.interrupt event will pause work on the current outcome and mark the span.outcome_evaluation_end.result as interrupted, allowing you to kick off a new outcome.
  • After the final outcome evaluation, the session can be continued as a conversational session, or a new outcome can be kicked off. The session will retain history of the prior outcome.

Define outcome user event

Only one outcome supported at a time, but you may chain together outcomes in sequence. To do this, send a new user.define_outcome event after the terminal event of the previous outcome.

This is the event you send to initiate an outcome. It is echoed back on receipt, including a processed_at timestamp and outcome_id.

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "file", "file_id": "file_01..." },
  "max_iterations": 5
}

Outcome evaluation start

Emitted once the grader starts an evaluation over one iteration loop. The iteration field is a 0-indexed revision counter: 0 is the first evaluation, 1 is the re-evaluation after the first revision, and so on.

{
  "type": "span.outcome_evaluation_start",
  "id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "iteration": 0,
  "processed_at": "2026-03-25T14:01:45Z"
}

Outcome evaluation ongoing

Heartbeat emitted while the grader runs. The grader's internal reasoning is opaque: you see that it's working, not what it's thinking.

{
  "type": "span.outcome_evaluation_ongoing",
  "id": "sevt_01ghi...",
  "outcome_id": "outc_01a...",
  "processed_at": "2026-03-25T14:02:10Z"
}

Outcome evaluation end

Emitted after the grader finishes evaluating one iteration. The result field indicates what happens next.

Result Next
satisfied Session transitions to idle.
needs_revision Agent starts a new iteration cycle.
max_iterations_reached No further evaluation cycles. The agent may run one final revision before the session transitions to idle.
failed Session transitions to idle. Returned when the rubric fundamentally does not match the task, for example if the description and rubric contradict each other.
interrupted Only emitted if outcome_evaluation_start already fired before the interrupt.
{
  "type": "span.outcome_evaluation_end",
  "id": "sevt_01jkl...",
  "outcome_evaluation_start_id": "sevt_01def...",
  "outcome_id": "outc_01a...",
  "result": "satisfied",
  "explanation": "All 12 criteria met: revenue projections use 5 years of historical data, WACC assumptions are stated, sensitivity table is included...",
  "iteration": 0,
  "usage": {
    "input_tokens": 2400,
    "output_tokens": 350,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 1800
  },
  "processed_at": "2026-03-25T14:03:00Z"
}

Checking on outcome status

You can either listen on the event stream for span.outcome_evaluation_end, or poll GET /v1/sessions/:id and read outcome_evaluations[].result:

curl CLI Python TypeScript C#Go Java PHP Ruby

session = client.beta.sessions.retrieve(session.id)

for outcome in session.outcome_evaluations:
    print(f"{outcome.outcome_id}: {outcome.result}")
    # outc_01a...: satisfied

Retrieving deliverables

The agent writes output files to /mnt/session/outputs/ inside the container. Once the session is idle, fetch them via the Files API scoped to the session:

curl CLI Python TypeScript C#Go Java PHP Ruby

# List files produced by this session
files = client.beta.files.list(scope_id=session.id)
for f in files:
    print(f.id, f.filename)

# Download a file
if files.data:
    content = client.beta.files.download(files.data[0].id)
    content.write_to_file("/tmp/output.txt")

Was this page helpful?

📋 讨论归档

讨论进行中…