返回列表
🧠 阿头学 · 💬 讨论题 · 💰投资

OpenClaw-RL——对话驱动的异步强化学习框架

OpenClaw-RL 把日常对话转化为训练信号,声称能让模型在无感后台持续自我优化,但"零标注"的说法掩盖了反馈成本的转移,而非消灭。
打开原文 ↗

2026-03-16 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 异步四环架构是真实差异化:将推理服务、数据收集、奖励评估、策略训练完全解耦为独立异步进程,模型在后台训练时前端零卡顿。这解决了传统 RLHF"停机训练"的痛点,工程上相当务实。
  • Binary RL + OPD 的组合方法有理论支撑:用标量奖励(好/坏)决定"值不值得学",用 Token 级方向信号(具体怎么改)指导优化方向。这种混合监督比单一方法信息密度更高,但具体效果对比缺乏严肃的定量证据。
  • "零人工标注"是话术优化,不是技术突破:文案强调"无需手动标注",但使用指南明确要求"频繁提供反馈(👍/👎)"或"提供具体文本反馈"。本质上是把离线标注转成在线交互式标注,标注成本并未消失,只是被包装成"正常使用"。
  • 评测证据极度主观且涉嫌过拟合:核心案例是"36/24 次交互后肉眼可见提升",但没有对照组、定量指标、统计显著性。在如此小的样本上微调,极可能只是对特定用户偏好的严重过拟合,而非真正的泛化能力提升。
  • 隐私承诺与云端部署存在张力:强调"自托管、隐私优先、数据留在系统内",但为解决算力门槛(默认需 8×GPU)推荐使用 Tinker 云平台,对数据流向和隐私保障完全没有交代。

跟我们的关联

  • 对 Neta 意味着什么:这是一个"在线学习闭环"的工程范式示范。如果你在做 AI 产品,核心竞争力正在从"初始模型有多强"转向"能多快地从用户行为里挖出高质量梯度"。下一步可以思考:你的产品反馈机制是否已经结构化成"标量+方向信号"的混合形式,还是只有零散的评分或文本吐槽。
  • 对 ATou 意味着什么:如果你在管理团队或做增长,"Binary RL + OPD"的反馈模型可以直接迁移:绩效评估既要有明确的成败界定(Binary),也要有高价值的事后指导(OPD)。关键是把这些 hint 系统化保留,而不是让它们随着一次复盘就消散。下一步可以尝试:设计一个"低摩擦 Binary 反馈 + 高价值 OPD 反馈"的双层反馈协议。
  • 对 Uota 意味着什么:这套框架对"本地化 AI Agent"有启发。不是"我调的 prompt 更好",而是"我能否让当地用户的行为自动训练出本地化策略"。如果你在做海外产品,可以考虑暴露一个反馈协议(点赞/点踩 + 简短原因),让系统自动从本地用户行为中学习偏好分布。长期护城河是这套反馈数据,而非初始模型。
  • 对通用意味着什么:这是"对话驱动学习"范式的一个具体实现,但其通用性被高估了。真正的限制在于:如果你的环境反馈本身就很稀疏或噪声很大(比如长期任务、安全关键场景),这套系统能否稳定工作完全未知。下一步需要追问:在什么条件下这套方法会失效,以及如何定义"显著提升"的客观指标。

讨论引子

  • 在只有 24-36 次交互的微调中,如何区分"真正的泛化能力提升"和"对特定用户偏好的过拟合"?是否有办法测试模型在新用户、新任务上的表现,而不仅仅是"肉眼可见"的主观感受?
  • "自托管 + 隐私优先"与"推荐使用 Tinker 云平台"之间的矛盾如何解决?如果数据最终要上云才能解决算力问题,隐私承诺的实际约束力是什么?
  • 在异步 RL 中,收集数据的策略(Behavior Policy)与正在训练的策略(Target Policy)之间的偏差(Off-policy drift)如何被控制?这个致命的算法稳定性问题在文档中完全没有提及。

GitHub - Gen-Verse/OpenClaw-RL: OpenClaw-RL: Train any agent simply by talking · GitHub

导航菜单

搜索或跳转到...

搜索代码、仓库、用户、议题、拉取请求...

提供反馈

我们会阅读每一条反馈,并非常认真地对待你的意见。

包含我的电子邮箱地址,以便联系我

取消 提交反馈

已保存的搜索

使用已保存的搜索更快地筛选结果

名称

查询

要查看所有可用限定符,请参阅我们的文档

取消 创建已保存的搜索

注册

  外观设置

正在重置焦点

你已在另一个标签页或窗口登录。重新加载以刷新会话。你已在另一个标签页或窗口退出。重新加载以刷新会话。你在另一个标签页或窗口切换了账号。重新加载以刷新会话。     关闭提示

{{ message }}

 [Gen-Verse](https://github.com/Gen-Verse)  /  [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)  公开

Star 3.1k

Gen-Verse/OpenClaw-RL

main

分支标签

转到文件

代码

打开更多操作菜单

文件夹与文件

名称名称

上次提交信息

上次提交日期

最新提交

历史记录

104 次提交

104 次提交

Megatron-LM

Megatron-LM

assets

assets

gui-rl

gui-rl

instructions

instructions

openclaw-combine

openclaw-combine

openclaw-opd

openclaw-opd

openclaw-rl

openclaw-rl

openclaw-test

openclaw-test

openclaw-tinker

openclaw-tinker

openclaw

openclaw

slime

slime

swe-rl

swe-rl

terminal-rl

terminal-rl

toolcall-rl

toolcall-rl

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

查看全部文件

仓库文件导航

OpenClaw-RL

以强化学习赋能 OpenClaw——只需与它对话,即可训练专属智能体。

面向真实世界的可扩展强化学习——用于终端、GUI、SWE 与工具调用场景的智能体强化学习。

 demo.mp4

📰 新闻

  • [2026/3/13] 🚀 OpenClaw-RL 现已同时支持本地 GPU 与云端(Tinker)部署。用一行代码即可启动——支持 Hybrid RL、OPD 与 Binary RL!

  • [2026/3/12] 🔥 我们现在支持 LoRA 训练了!

  • [2026/3/10] 🔥 我们发布了技术报告!🏆 在 HuggingFace Daily Papers 上排名 #1!

  • [2026/3/10] 🔥 今天有重大更新!我们发布了新的组合方法,并对这些 OpenClaw-RL 方法做了一项有趣的评测。Track 2 也已发布,提供面向通用智能体场景的可扩展 RL 实现,覆盖终端GUISWE工具调用等场景。我们只专注真实世界设置!

  • [2026/3/3] 🙌 我们与 SDFTSDPO 的作者合作,将其方法集成到 openclaw-opd 中。欢迎集成更多新颖而有效的方法!

  • [2026/3/3] 📺 看看社区制作的 OpenClaw-RL 教程视频:视频 1 | 视频 2

  • [2026/2/26] 🔥 我们发布 OpenClaw-RL v1——一个完全异步的 RL 框架,可从自然对话反馈中训练个性化 AI 智能体。

💡 TL;DR

OpenClaw-RL 是一个完全异步的强化学习框架,它将日常对话转化为训练信号,用于打造个性化 AI 智能体;同时也支持通过大规模环境并行来训练通用智能体。

多数面向 LLM 的 RL 系统假设采用集中式、批处理训练,并基于预先收集的数据集。OpenClaw-RL 采取了根本不同的路径:它将你自托管的模型封装进 OpenClaw 并作为与 OpenAI 兼容的 API,对实时多轮对话进行拦截,并在后台持续优化策略——全程不打断你的使用体验。

亮点:完全异步的 4 组件循环 · 自托管、隐私优先 · 零人工标注 · 三种学习范式(Binary RL / OPD / Combine)· 同时支持个人与通用智能体

🌈 特性

完全异步的 4 组件架构

OpenClaw-RL 将智能体服务、rollout 收集、PRM/裁判评估、策略训练解耦为独立的异步循环。它们彼此不阻塞:模型在后台训练的同时仍可持续提供服务;评估也会与新的交互并发进行。

自托管,隐私优先的设计

整个栈(包括策略模型、judge/PRM 以及训练器)都运行在你自己的基础设施上。对话数据留在你的系统内,不需要任何第三方模型 API。

从反馈到梯度——全自动

你无需手动标注数据。系统会自动:

  • 将多轮交互整理为具备会话语境的训练轨迹

  • 将 API 消息分类为主线(可训练)与支线(不可训练)的轮次

  • 将下一步来自用户、环境或工具的反馈用作自然的“下一状态”信号

  • 异步运行 PRM/judge 评估;在需要时用多数投票提升评分稳健性

  • 一旦样本就绪,即提交给训练器进行训练

一个框架内集成三种优化方法

Binary RL(GRPO):过程奖励模型(Process Reward Model)会基于下一状态反馈为每个轮次打分。随后使用 GRPO 的优势估计与 PPO 风格的裁剪代理损失(clipped surrogate loss)来利用该标量奖励。

On-Policy Distillation(OPD):当下一状态揭示有价值的事后信息时,裁判模型会提取一条文本提示(hint)。该提示会与原始提示词合并,形成增强后的教师(teacher);教师与学生在 token 级对数概率上的差距将转化为方向性的优势信号,其信息量比任何标量奖励都更丰富。

组合方法:OpenClaw-RL 将 Binary RL 与 OPD 进一步融合为统一的训练配方,既利用 Binary RL 的稠密标量监督,也引入 OPD 更丰富的 token 级方向信号。该组合相较任一方法单独使用,可实现更强、更稳健的优化效果。

从个人智能体到真实世界的智能体强化学习

同一套框架既支持个性化的 OpenClaw 优化,也支持面向真实世界设置的终端、GUI、SWE 与工具调用智能体的可扩展 RL。

🎯 路线图

我们的长期目标是用强化学习推进更个性化、在实践中真正有用的智能体。路线图分为两条轨道:

Track 1 — 个人智能体优化(小规模但更个人化)

✅ 发布 Track 1:完全异步的 OpenClaw-RL 框架,包含 Binary RL + OPD ✅ 通过演示实验发现最佳配方 ✅ 支持 LoRA 训练 ✅ 在 Tinker 上部署训练 ⬜ 支持低精度训练/推理 ⬜ 不止于策略:将学习扩展到技能与记忆

Track 2 — 通用智能体优化(可扩展基础设施)

✅ 发布 Track 2:面向通用智能体的可扩展智能体 RL 基础设施 ⬜ 支持更多云服务

🤝 贡献

我们欢迎将新学习方法集成进 OpenClaw-RL 框架的贡献!将 SDFT / SDPO 集成到 openclaw-opd,以及支持 LoRA,都是成功的社区贡献范例。

最希望看到的贡献:

  • 🤖 使用 slime 支持 Qwen3.5 模型——为 Qwen3.5 家族补充启动脚本与模型配置

  • 🔧 低精度训练示例——为现有方法补充 FP8/INT4 训练脚本

📋 完整贡献指南 功能愿望清单

征集贡献

我们欢迎社区为 OpenClaw-RL 做出贡献!本文档概述我们的贡献原则,以及我们特别希望获得帮助的功能。

贡献指南

OpenClaw-RL 以一组自包含的方法文件夹组织(例如 openclaw-rl/、openclaw-opd/、openclaw-combine/),它们与共享的 slime/ 训练框架和 openclaw/ 运行时并列。

贡献通常分为两类:

添加新的方法或部署目标

创建一个新的顶层文件夹(与 openclaw-opd/ 等现有文件夹并列)。所有与该方法相关的代码——启动脚本、自定义损失函数、rollout 逻辑、API 服务适配器、数据处理以及 README——都应放在此文件夹内。

扩展已有方法

对于已有方法文件夹内的改动——例如支持新的模型家族、添加 LoRA 变体,或提供低精度示例——请新增文件(例如新的 .sh 脚本、新的数据处理脚本),而不是修改已有文件。这样原本可用的示例不会被破坏,你的新增内容也能被独立审查。

通用原则

不要修改核心框架。除非绝对必要,请避免改动 slime/、Megatron-LM/ 或 openclaw/。框架提供了扩展点(--custom-loss-function-path、--rollout-function-path、--custom-generate-function-path、--custom-rm-path 等),专门用于让新方法无需触碰共享代码即可接入。如果确实需要修改框架,请单独开一个 PR,并给出清晰的理由。

包含文档。对于新的方法文件夹,请添加 README.md,说明该方法做什么、如何运行、关键环境变量与文件结构。对于对现有文件夹的新增,请在现有 README.md 中添加新章节。可参考 openclaw-combine/README.mdtoolcall-rl/README.md 的优秀示例。

遵循既有约定。使用与现有方法一致的 shell 脚本结构(GPU 分区、CKPT_ARGS、ROLLOUT_ARGS、OPTIMIZER_ARGS 等)、环境变量命名,以及 ray job submit 的启动模式。

高度优先的功能

1. 🤖 为 slime 支持 Qwen3.5 模型

类型:扩展现有方法文件夹

目标:在现有方法中为 Qwen3.5 家族补齐启动脚本与模型配置。

要求:

  • 在相关方法文件夹中新增 Qwen3.5 的 .sh 脚本(例如 openclaw-combine/run_qwen35_4b_openclaw_combine.sh)。

  • 若 Qwen3.5 相比 Qwen3 需要不同的架构参数(hidden size、num layers 等),请在 slime/scripts/models/ 中新增对应模型配置。

  • 验证并记录 tokenizer、chat template、reasoning parser 或 tool-call parser 的兼容性调整需求。

  • 更新各 README,将 Qwen3.5 列为受支持模型。

2. 🔧 低精度训练/推理示例

类型:扩展现有方法文件夹

目标:为现有方法补充低精度(例如 INT8/INT4 推理、BF16/FP8 训练)示例脚本,让用户能在更少 GPU 的消费级硬件上运行 OpenClaw-RL。

要求:

  • 在现有方法文件夹内新增 .sh 脚本——不要修改已有脚本。

  • 低精度推理:演示以量化权重(例如 AWQ/GPTQ INT4)启动 SGLang rollout 引擎,从而降低服务侧的显存占用。

  • 低精度训练:若 Megatron 后端支持,演示 FP8 或混合精度配置,以降低训练内存。

  • 在每个方法文件夹对应的 README.md 中新增章节,说明这些脚本。

如果你对其中任何一项感兴趣,欢迎先开一个 issue 讨论方案再提交 PR。我们很乐意提供指导并进行评审!

📝 目录

🔧 个人智能体优化快速开始

1. 部署选项

没有钱?

  • 硬件:8× GPUs(默认;可通过 NUM_GPUS、ACTOR_GPUS、ROLLOUT_GPUS、PRM_GPUS 配置)

  • 软件:CUDA 12.9,Python 3.12

  • 框架:Slime(我们的基础 RL 框架)

更详细的环境搭建请参阅 Slime./instructions/README.md

没有 GPU?

创建一个 Tinker API。这就是你所需要的全部。但请注意,Tinker 只支持 LoRA,这可能不如完整微调有效,所以我们仍在测试中。

2. 启动 RL 服务器

我们提供三种方法(RL 服务器):

Dimension [Binary RL](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-rl) [OPD](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-opd) [Combined](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-combine)     Signal type Evaluative (good / bad) Directional Evaluative + directional   Advantage Sequence-level scalar Token-level directional Mixed sequence and token-level   Density All scored turns Hint-accepted turns only All scored turns   Feedback type User / environment Explicit corrections Both implicit and explicit feedback   Signal richness 1 scalar per sample 1 value per token 1 value per token

选择你的优化方法:

选项 A:组合方法——推荐!

cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

该方法结合了 binary RL 与 OPD,可实现最佳优化效果。

算法细节请见 ./openclaw-combine/README.md

使用 LoRA(参数高效、需要更少 GPU):

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.sh

使用 Tinker(完全不需要 GPU)

cd openclaw-tinker
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1 --w-opd 1.0 --w-rl 1.0

搭建细节请见 ./openclaw-tinker/README.md

选项 B:Binary RL——最适合隐式反馈(点赞/点踩、环境成功/失败)

cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

PRM 会基于下一状态反馈自动评判回复质量。我们建议频繁提供反馈(例如 👍/👎),以帮助模型更有效地优化。

算法细节请见 ./openclaw-rl/README.md

使用 LoRA(参数高效、需要更少 GPU):

bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.sh

使用 Tinker(完全不需要 GPU)

cd openclaw-tinker
python run.py --method rl --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 3

搭建细节请见 ./openclaw-tinker/README.md

选项 C:On-Policy Distillation(OPD)——最适合丰富的文本反馈

cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

系统会从你的反馈中提取事后提示,并在 token 级将其蒸馏进策略中。我们建议提供具体反馈(例如“你应该先检查文件”或“不要用那个库”)。

算法细节请见 ./openclaw-opd/README.md

使用 LoRA(参数高效、需要更少 GPU):

bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.sh

使用 Tinker(完全不需要 GPU)

cd openclaw-tinker
python run.py --method opd --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1

搭建细节请见 ./openclaw-tinker/README.md

启动后,模型将以与 OpenAI 兼容的 API 形式对外提供服务,地址为:

http://HOST_IP:30000/v1

其中 HOST_IP 为运行 RL 服务器的机器 IP 地址(例如 115.190.98.251)。端口 30000 为默认值,可通过 PORT 环境变量更改。

请记下该 endpoint——下一步配置 OpenClaw 时会用到。

我们也提供了一个有趣的评测案例:一名学生使用 OpenClaw 做作业,但不希望被发现使用 AI;一名老师也用 OpenClaw 来批改学生作业,希望点评具体且友好。

评测设置——学生与老师都使用 AI!

我们发现,在组合优化方法下,OpenClaw 在学生设置中只需 36 次解题交互、在老师设置中只需 24 次批改交互,就能获得显著且肉眼可见的提升。

搭建与算法细节请见 ./openclaw-test/README.md

3. OpenClaw 设置

从本仓库内捆绑的版本安装 OpenClaw(我们会定期更新):

如果你希望在捆绑的 OpenClaw 运行时中进行基于本地文件的技能编写,请参见 openclaw/extensions/skill-bridge/README.md

然后配置 OpenClaw,将请求路由到你的 RL 服务器。

打开你的 openclaw.json(或等效的设置文件),在 “models” → “providers” 下添加一个 provider 条目:

Slime 版 RL 服务器示例:

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://HOST_IP:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

将 HOST_IP 替换为你的 RL 服务器机器的 IP 地址。apiKey 应与启动服务器时设置的 SGLANG_API_KEY 一致。

Tinker 版 RL 服务器示例:

{
  "models": {
    "providers": {
      "openclaw-rl": {
        "baseUrl": "http://localhost:30000/v1",
        "apiKey": "no-auth-needed",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b-lora",
            "name": "Qwen3 4B (OpenClaw-RL LoRA)",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

就这样——开始与你的 OpenClaw 智能体聊天吧。RL 服务器会自动收集对话轨迹、计算奖励并训练模型。你用得越多,智能体就会越强。

🔧 真实世界设置中的智能体 RL

支撑我们个人智能体设置的同一套异步 RL 主干,也能为这些更广泛的真实世界环境提供大规模优化能力。

Setting Environment Next-state signal Horizon     Terminal Shell execution sandbox stdout/stderr, exit code Long   GUI Screen state + accessibility tree Visual state diff, task progress Long   SWE Code repository + test suite Test verdicts, diff, lint output Long   Tool-call API/function execution Return values, error traces Medium

🖥️ 终端智能体——最常用的计算机使用智能体

cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.sh

搭建细节请见 ./terminal-rl/README.md

📟 GUI 智能体——最通用的计算机使用智能体

cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.sh

搭建细节请见 ./gui-rl/README.md

👨‍💻 SWE 智能体——软件工程智能体

cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.sh

搭建细节请见 ./swe-rl/README.md

🛠️ 工具调用智能体——最实用的智能体

cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.sh

搭建细节请见 ./toolcall-rl/README.md

📖 引用

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

🙏 致谢

本工作旨在探索更有效的智能体强化学习范式。我们的实现建立在多个优秀代码库之上,包括 slimeOpenClawTinkerOpen-AgentRL

我们还基于 SETA 的数据集与智能体框架构建终端 RL;基于 OSWorld 的评测脚本构建 GUI RL;基于 mini-swe-agent 的评测脚本构建 SWE RL;并以 Retool 的工作为基础构建工具调用 RL。

我们由衷感谢这些项目提供的宝贵洞见与高质量实现,它们极大地促进了我们的研究。

⚠️ 提醒

使用 OpenClaw-RL 时,请不要在与模型的对话中提供敏感的个人信息。同时,请务必妥善保管你的 API keys,切勿在提示词、日志或共享文件中泄露它们。

关于

OpenClaw-RL:只需对话即可训练任意智能体

 [arxiv.org/abs/2603.10165](https://arxiv.org/abs/2603.10165)

主题

async gui-application coding slime tinker memory-systems skill-learning rlhf sglang grpo on-policy-distillation openclaw-skills open-claw

资源

Readme

许可证

Apache-2.0 许可证

哎呀!

加载时发生错误。请重新加载此页面。

动态

自定义属性

Stars

3.1k stars

Watchers

24 watching

Forks

289 forks

举报仓库

发布

未发布任何版本

[Packages

  0](https://github.com/orgs/Gen-Verse/packages?repo_name=OpenClaw-RL)

哎呀!

加载时发生错误。请重新加载此页面。

贡献者

哎呀!

加载时发生错误。请重新加载此页面。

语言

页脚

2026 GitHub,Inc.

页脚导航

GitHub - Gen-Verse/OpenClaw-RL: OpenClaw-RL: Train any agent simply by talking · GitHub

GitHub - Gen-Verse/OpenClaw-RL: OpenClaw-RL: Train any agent simply by talking · GitHub

Navigation Menu

Search or jump to...

导航菜单

搜索或跳转到...

Search code, repositories, users, issues, pull requests...

搜索代码、仓库、用户、议题、拉取请求...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel Submit feedback

提供反馈

我们会阅读每一条反馈,并非常认真地对待你的意见。

包含我的电子邮箱地址,以便联系我

取消 提交反馈

Saved searches

已保存的搜索

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel Create saved search

  Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.     Dismiss alert

{{ message }}

 [Gen-Verse](https://github.com/Gen-Verse)  /  [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)  Public

Star 3.1k

使用已保存的搜索更快地筛选结果

名称

查询

要查看所有可用限定符,请参阅我们的文档

取消 创建已保存的搜索

注册

  外观设置

正在重置焦点

你已在另一个标签页或窗口登录。重新加载以刷新会话。你已在另一个标签页或窗口退出。重新加载以刷新会话。你在另一个标签页或窗口切换了账号。重新加载以刷新会话。     关闭提示

{{ message }}

 [Gen-Verse](https://github.com/Gen-Verse)  /  [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)  公开

Star 3.1k

Gen-Verse/OpenClaw-RL

main

BranchesTags

Go to file

Code

Open more actions menu

Gen-Verse/OpenClaw-RL

main

分支标签

转到文件

代码

打开更多操作菜单

Folders and files

NameName

Last commit message

Last commit date

文件夹与文件

名称名称

上次提交信息

上次提交日期

Latest commit

最新提交

Repository files navigation

仓库文件导航

OpenClaw-RL

Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.

Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.

 demo.mp4

OpenClaw-RL

以强化学习赋能 OpenClaw——只需与它对话,即可训练专属智能体。

面向真实世界的可扩展强化学习——用于终端、GUI、SWE 与工具调用场景的智能体强化学习。

 demo.mp4

📰 News

  • [2026/3/13] 🚀 OpenClaw-RL now supports both local GPU and cloud (Tinker) deployment. Launch with one line of code — Hybrid RL, OPD, and Binary RL all supported!

  • [2026/3/12] 🔥 We support LoRA training now!

  • [2026/3/10] 🔥 We have released our Technical Report! 🏆 Ranked #1 on HuggingFace Daily Papers!

  • [2026/3/10] 🔥 Huge updates today! We released a new combination method, along with an interesting evaluation of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across terminal, GUI, SWE, and tool-call scenarios. We only focus on real-world settings!

  • [2026/3/3] 🙌 Working with the authors of SDFT and SDPO, we have integrated their methods into openclaw-opd. We welcome the integration of novel and effective methods!

  • [2026/3/3] 📺 Check out these community tutorial videos on OpenClaw-RL: Video 1 | Video 2

  • [2026/2/26] 🔥 We release OpenClaw-RL v1 — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

📰 新闻

  • [2026/3/13] 🚀 OpenClaw-RL 现已同时支持本地 GPU 与云端(Tinker)部署。用一行代码即可启动——支持 Hybrid RL、OPD 与 Binary RL!

  • [2026/3/12] 🔥 我们现在支持 LoRA 训练了!

  • [2026/3/10] 🔥 我们发布了技术报告!🏆 在 HuggingFace Daily Papers 上排名 #1!

  • [2026/3/10] 🔥 今天有重大更新!我们发布了新的组合方法,并对这些 OpenClaw-RL 方法做了一项有趣的评测。Track 2 也已发布,提供面向通用智能体场景的可扩展 RL 实现,覆盖终端GUISWE工具调用等场景。我们只专注真实世界设置!

  • [2026/3/3] 🙌 我们与 SDFTSDPO 的作者合作,将其方法集成到 openclaw-opd 中。欢迎集成更多新颖而有效的方法!

  • [2026/3/3] 📺 看看社区制作的 OpenClaw-RL 教程视频:视频 1 | 视频 2

  • [2026/2/26] 🔥 我们发布 OpenClaw-RL v1——一个完全异步的 RL 框架,可从自然对话反馈中训练个性化 AI 智能体。

💡 TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

Highlights: Fully async 4-component loop · Self-hosted private · Zero manual labeling · Three learning paradigms (Binary RL / OPD / Combine) · Personal + General agent support

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.

Self-Hosted Private by Design

The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.

From Feedback to Gradient — Automatically

You do not need to manually label data. The system automatically:

  • Organizes multi-turn interactions into session-aware training trajectories

  • Classifies API messages into main-line (trainable) vs. side (non-trainable) turns

  • Uses the next user, environment, or tool feedback as a natural "next-state" signal

  • Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring

  • Submits ready samples to the trainer as they become available

Three Optimization Methods in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.

Combination Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.

From Personal Agents to Real-World Agentic RL

The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.

💡 TL;DR

OpenClaw-RL 是一个完全异步的强化学习框架,它将日常对话转化为训练信号,用于打造个性化 AI 智能体;同时也支持通过大规模环境并行来训练通用智能体。

多数面向 LLM 的 RL 系统假设采用集中式、批处理训练,并基于预先收集的数据集。OpenClaw-RL 采取了根本不同的路径:它将你自托管的模型封装进 OpenClaw 并作为与 OpenAI 兼容的 API,对实时多轮对话进行拦截,并在后台持续优化策略——全程不打断你的使用体验。

亮点:完全异步的 4 组件循环 · 自托管、隐私优先 · 零人工标注 · 三种学习范式(Binary RL / OPD / Combine)· 同时支持个人与通用智能体

🌈 特性

完全异步的 4 组件架构

OpenClaw-RL 将智能体服务、rollout 收集、PRM/裁判评估、策略训练解耦为独立的异步循环。它们彼此不阻塞:模型在后台训练的同时仍可持续提供服务;评估也会与新的交互并发进行。

自托管,隐私优先的设计

整个栈(包括策略模型、judge/PRM 以及训练器)都运行在你自己的基础设施上。对话数据留在你的系统内,不需要任何第三方模型 API。

从反馈到梯度——全自动

你无需手动标注数据。系统会自动:

  • 将多轮交互整理为具备会话语境的训练轨迹

  • 将 API 消息分类为主线(可训练)与支线(不可训练)的轮次

  • 将下一步来自用户、环境或工具的反馈用作自然的“下一状态”信号

  • 异步运行 PRM/judge 评估;在需要时用多数投票提升评分稳健性

  • 一旦样本就绪,即提交给训练器进行训练

一个框架内集成三种优化方法

Binary RL(GRPO):过程奖励模型(Process Reward Model)会基于下一状态反馈为每个轮次打分。随后使用 GRPO 的优势估计与 PPO 风格的裁剪代理损失(clipped surrogate loss)来利用该标量奖励。

On-Policy Distillation(OPD):当下一状态揭示有价值的事后信息时,裁判模型会提取一条文本提示(hint)。该提示会与原始提示词合并,形成增强后的教师(teacher);教师与学生在 token 级对数概率上的差距将转化为方向性的优势信号,其信息量比任何标量奖励都更丰富。

组合方法:OpenClaw-RL 将 Binary RL 与 OPD 进一步融合为统一的训练配方,既利用 Binary RL 的稠密标量监督,也引入 OPD 更丰富的 token 级方向信号。该组合相较任一方法单独使用,可实现更强、更稳健的优化效果。

从个人智能体到真实世界的智能体强化学习

同一套框架既支持个性化的 OpenClaw 优化,也支持面向真实世界设置的终端、GUI、SWE 与工具调用智能体的可扩展 RL。

🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

✅ Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD ✅ Best recipe discovery via demonstration experiments ✅ Support LoRA Training ✅ Deploy training on Tinker ⬜ Support low-precision training/inference ⬜ Beyond the policy: extend learning to skills and memory

Track 2 — General Agents Optimization (Scalable Infra)

✅ Release Track 2: Scalable agentic RL infra for general agents ⬜ Support more cloud services

🎯 路线图

我们的长期目标是用强化学习推进更个性化、在实践中真正有用的智能体。路线图分为两条轨道:

Track 1 — 个人智能体优化(小规模但更个人化)

✅ 发布 Track 1:完全异步的 OpenClaw-RL 框架,包含 Binary RL + OPD ✅ 通过演示实验发现最佳配方 ✅ 支持 LoRA 训练 ✅ 在 Tinker 上部署训练 ⬜ 支持低精度训练/推理 ⬜ 不止于策略:将学习扩展到技能与记忆

Track 2 — 通用智能体优化(可扩展基础设施)

✅ 发布 Track 2:面向通用智能体的可扩展智能体 RL 基础设施 ⬜ 支持更多云服务

🤝 Contributing

We welcome contributions that integrate new learning methods into the OpenClaw-RL framework! The integration of SDFT / SDPO into openclaw-opd, and supporting LoRA are great examples of successful community contributions.

Highly wanted contributions:

  • 🤖 Qwen3.5 model support with slime — launch scripts and model configs for the Qwen3.5 family

  • 🔧 Low-precision training examples — FP8/INT4 training scripts for existing methods

📋 Full contribution guidelines feature wishlist

🤝 贡献

我们欢迎将新学习方法集成进 OpenClaw-RL 框架的贡献!将 SDFT / SDPO 集成到 openclaw-opd,以及支持 LoRA,都是成功的社区贡献范例。

最希望看到的贡献:

  • 🤖 使用 slime 支持 Qwen3.5 模型——为 Qwen3.5 家族补充启动脚本与模型配置

  • 🔧 低精度训练示例——为现有方法补充 FP8/INT4 训练脚本

📋 完整贡献指南 功能愿望清单

Call for Contributions

We welcome community contributions to OpenClaw-RL! This document outlines our contribution principles and the features we'd love help with.

征集贡献

我们欢迎社区为 OpenClaw-RL 做出贡献!本文档概述我们的贡献原则,以及我们特别希望获得帮助的功能。

Contribution Guidelines

OpenClaw-RL is organized as a collection of self-contained method folders (e.g., openclaw-rl/, openclaw-opd/, openclaw-combine/), each sitting alongside the shared slime/ training framework and openclaw/ runtime.

Contributions generally fall into two categories:

Adding a new method or deployment target

Create a new top-level folder (parallel to existing ones like openclaw-opd/). All method-specific code — launch scripts, custom loss functions, rollout logic, API server adapters, data processing, and the README — should live inside this folder.

Extending an existing method

For changes within an existing method folder — such as supporting a new model family, adding a LoRA variant, or a low-precision example — add new files (e.g., a new .sh script, a new data processing script) rather than modifying existing ones. This way the original working examples stay intact and your addition can be reviewed independently.

General principles

Do not modify the core framework. Avoid changes to slime/, Megatron-LM/, or openclaw/ unless absolutely necessary. The framework exposes extension points (--custom-loss-function-path, --rollout-function-path, --custom-generate-function-path, --custom-rm-path, etc.) specifically so that new methods can plug in without touching shared code. If a framework change is truly needed, please open a separate PR for it with a clear justification.

Include documentation. For a new method folder, add a README.md explaining what the method does, how to run it, key environment variables, and file structure. For additions to existing folders, update the existing README.md with a new section. See openclaw-combine/README.md or toolcall-rl/README.md for good examples.

Follow existing conventions. Use the same shell script structure (GPU partitioning, CKPT_ARGS, ROLLOUT_ARGS, OPTIMIZER_ARGS, etc.), environment variable naming, and ray job submit launch pattern used by the existing methods.

贡献指南

OpenClaw-RL 以一组自包含的方法文件夹组织(例如 openclaw-rl/、openclaw-opd/、openclaw-combine/),它们与共享的 slime/ 训练框架和 openclaw/ 运行时并列。

贡献通常分为两类:

添加新的方法或部署目标

创建一个新的顶层文件夹(与 openclaw-opd/ 等现有文件夹并列)。所有与该方法相关的代码——启动脚本、自定义损失函数、rollout 逻辑、API 服务适配器、数据处理以及 README——都应放在此文件夹内。

扩展已有方法

对于已有方法文件夹内的改动——例如支持新的模型家族、添加 LoRA 变体,或提供低精度示例——请新增文件(例如新的 .sh 脚本、新的数据处理脚本),而不是修改已有文件。这样原本可用的示例不会被破坏,你的新增内容也能被独立审查。

通用原则

不要修改核心框架。除非绝对必要,请避免改动 slime/、Megatron-LM/ 或 openclaw/。框架提供了扩展点(--custom-loss-function-path、--rollout-function-path、--custom-generate-function-path、--custom-rm-path 等),专门用于让新方法无需触碰共享代码即可接入。如果确实需要修改框架,请单独开一个 PR,并给出清晰的理由。

包含文档。对于新的方法文件夹,请添加 README.md,说明该方法做什么、如何运行、关键环境变量与文件结构。对于对现有文件夹的新增,请在现有 README.md 中添加新章节。可参考 openclaw-combine/README.mdtoolcall-rl/README.md 的优秀示例。

遵循既有约定。使用与现有方法一致的 shell 脚本结构(GPU 分区、CKPT_ARGS、ROLLOUT_ARGS、OPTIMIZER_ARGS 等)、环境变量命名,以及 ray job submit 的启动模式。

Highly Preferred Features

1. 🤖 Qwen3.5 Model Support of slime

Type: Extend existing method folders

Goal: Add launch scripts and model configurations for the Qwen3.5 family across existing methods.

Requirements:

  • Add new .sh scripts for Qwen3.5 in relevant method folders (e.g., openclaw-combine/run_qwen35_4b_openclaw_combine.sh).

  • Add the corresponding model config in slime/scripts/models/ if Qwen3.5 requires different architecture parameters (hidden size, num layers, etc.) from Qwen3.

  • Verify and document any changes needed for tokenizer, chat template, reasoning parser, or tool-call parser compatibility.

  • Update READMEs to list Qwen3.5 as a supported model.

2. 🔧 Low-Precision Training/Inference Examples

Type: Extend existing method folders

Goal: Add low-precision (e.g., INT8/INT4 inference, BF16/FP8 training) example scripts to existing method folders, enabling users to run OpenClaw-RL on consumer-grade hardware with fewer GPUs.

Requirements:

  • Add new .sh scripts within existing method folders — do not modify existing scripts.

  • Low-precision inference: demonstrate launching the SGLang rollout engine with quantized weights (e.g., AWQ/GPTQ INT4) to reduce VRAM for the serving side.

  • Low-precision training: if supported by the Megatron backend, demonstrate FP8 or mixed-precision configurations that reduce training memory.

  • Update the corresponding README.md in each method folder with a new section documenting these scripts.

If you're interested in any of these, feel free to open an issue to discuss your approach before submitting a PR. We're happy to provide guidance and review!

高度优先的功能

1. 🤖 为 slime 支持 Qwen3.5 模型

类型:扩展现有方法文件夹

目标:在现有方法中为 Qwen3.5 家族补齐启动脚本与模型配置。

要求:

  • 在相关方法文件夹中新增 Qwen3.5 的 .sh 脚本(例如 openclaw-combine/run_qwen35_4b_openclaw_combine.sh)。

  • 若 Qwen3.5 相比 Qwen3 需要不同的架构参数(hidden size、num layers 等),请在 slime/scripts/models/ 中新增对应模型配置。

  • 验证并记录 tokenizer、chat template、reasoning parser 或 tool-call parser 的兼容性调整需求。

  • 更新各 README,将 Qwen3.5 列为受支持模型。

2. 🔧 低精度训练/推理示例

类型:扩展现有方法文件夹

目标:为现有方法补充低精度(例如 INT8/INT4 推理、BF16/FP8 训练)示例脚本,让用户能在更少 GPU 的消费级硬件上运行 OpenClaw-RL。

要求:

  • 在现有方法文件夹内新增 .sh 脚本——不要修改已有脚本。

  • 低精度推理:演示以量化权重(例如 AWQ/GPTQ INT4)启动 SGLang rollout 引擎,从而降低服务侧的显存占用。

  • 低精度训练:若 Megatron 后端支持,演示 FP8 或混合精度配置,以降低训练内存。

  • 在每个方法文件夹对应的 README.md 中新增章节,说明这些脚本。

如果你对其中任何一项感兴趣,欢迎先开一个 issue 讨论方案再提交 PR。我们很乐意提供指导并进行评审!

🔧 Personal Agent Optimization Quick Start

1. Deployment Options

Don't have any money?

  • Hardware: 8× GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)

  • Software: CUDA 12.9, Python 3.12

  • Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

Don't have a GPU?

Create a Tinker API. That's all you need. But note that Tinker only supports LoRA, which may not be as effective as full fine-tuning. So we are still testing it.

2. Start the RL Server

We provide three methods (RL servers):

Dimension [Binary RL](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-rl) [OPD](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-opd) [Combined](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-combine)     Signal type Evaluative (good / bad) Directional Evaluative + directional   Advantage Sequence-level scalar Token-level directional Mixed sequence and token-level   Density All scored turns Hint-accepted turns only All scored turns   Feedback type User / environment Explicit corrections Both implicit and explicit feedback   Signal richness 1 scalar per sample 1 value per token 1 value per token

Choose your optimization method:

Option A: Combination Method — Recommended !

cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

This method combines binary RL and OPD to achieve the best optimization.

See ./openclaw-combine/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1 --w-opd 1.0 --w-rl 1.0

see ./openclaw-tinker/README.md for setup details.

Option B: Binary RL — Best for implicit feedback (likes/dislikes, env success/failure)

cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

The PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., 👍/👎) to help the model optimize effectively.

See ./openclaw-rl/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method rl --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 3

see ./openclaw-tinker/README.md for setup details.

Option C: On-Policy Distillation (OPD) — Best for rich textual feedback

cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

The system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").

See ./openclaw-opd/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method opd --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1

see ./openclaw-tinker/README.md for setup details.

Once running, the model is served as an OpenAI-compatible API at:

http://HOST_IP:30000/v1

where HOST_IP is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.

Take note of this endpoint — you will need it when configuring OpenClaw in the next step.

We also provide an interesting case for evaluation. A student who uses OpenClaw to do homework, does not want to be found using AI. A teacher who also uses OpenClaw to grade student's homework, wants the comments to be specific and friendly.

Evaluation Setting — Both student and teacher use AI!

We find that, under the combined optimization method, OpenClaw needs only 36 problem-solving interactions in the student setting and 24 grading interactions in the teacher setting to achieve a significant and clearly visible improvement.

See ./openclaw-test/README.md for setup and algorithm details.

3. OpenClaw Setup

Install OpenClaw from the version bundled in this repository (we will update it regularly):

If you want local file-backed skill authoring in the bundled OpenClaw runtime, see openclaw/extensions/skill-bridge/README.md.

Then configure OpenClaw to route requests to your RL server.

Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" → "providers":

Example of Slime-based RL server:

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://HOST_IP:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace HOST_IP with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.

Example of Tinker-based RL server:

{
  "models": {
    "providers": {
      "openclaw-rl": {
        "baseUrl": "http://localhost:30000/v1",
        "apiKey": "no-auth-needed",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b-lora",
            "name": "Qwen3 4B (OpenClaw-RL LoRA)",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

That's it — start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.

🔧 个人智能体优化快速开始

1. 部署选项

没有钱?

  • 硬件:8× GPUs(默认;可通过 NUM_GPUS、ACTOR_GPUS、ROLLOUT_GPUS、PRM_GPUS 配置)

  • 软件:CUDA 12.9,Python 3.12

  • 框架:Slime(我们的基础 RL 框架)

更详细的环境搭建请参阅 Slime./instructions/README.md

没有 GPU?

创建一个 Tinker API。这就是你所需要的全部。但请注意,Tinker 只支持 LoRA,这可能不如完整微调有效,所以我们仍在测试中。

2. 启动 RL 服务器

我们提供三种方法(RL 服务器):

Dimension [Binary RL](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-rl) [OPD](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-opd) [Combined](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-combine)     Signal type Evaluative (good / bad) Directional Evaluative + directional   Advantage Sequence-level scalar Token-level directional Mixed sequence and token-level   Density All scored turns Hint-accepted turns only All scored turns   Feedback type User / environment Explicit corrections Both implicit and explicit feedback   Signal richness 1 scalar per sample 1 value per token 1 value per token

选择你的优化方法:

选项 A:组合方法——推荐!

cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

该方法结合了 binary RL 与 OPD,可实现最佳优化效果。

算法细节请见 ./openclaw-combine/README.md

使用 LoRA(参数高效、需要更少 GPU):

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.sh

使用 Tinker(完全不需要 GPU)

cd openclaw-tinker
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1 --w-opd 1.0 --w-rl 1.0

搭建细节请见 ./openclaw-tinker/README.md

选项 B:Binary RL——最适合隐式反馈(点赞/点踩、环境成功/失败)

cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

PRM 会基于下一状态反馈自动评判回复质量。我们建议频繁提供反馈(例如 👍/👎),以帮助模型更有效地优化。

算法细节请见 ./openclaw-rl/README.md

使用 LoRA(参数高效、需要更少 GPU):

bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.sh

使用 Tinker(完全不需要 GPU)

cd openclaw-tinker
python run.py --method rl --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 3

搭建细节请见 ./openclaw-tinker/README.md

选项 C:On-Policy Distillation(OPD)——最适合丰富的文本反馈

cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

系统会从你的反馈中提取事后提示,并在 token 级将其蒸馏进策略中。我们建议提供具体反馈(例如“你应该先检查文件”或“不要用那个库”)。

算法细节请见 ./openclaw-opd/README.md

使用 LoRA(参数高效、需要更少 GPU):

bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.sh

使用 Tinker(完全不需要 GPU)

cd openclaw-tinker
python run.py --method opd --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1

搭建细节请见 ./openclaw-tinker/README.md

启动后,模型将以与 OpenAI 兼容的 API 形式对外提供服务,地址为:

http://HOST_IP:30000/v1

其中 HOST_IP 为运行 RL 服务器的机器 IP 地址(例如 115.190.98.251)。端口 30000 为默认值,可通过 PORT 环境变量更改。

请记下该 endpoint——下一步配置 OpenClaw 时会用到。

我们也提供了一个有趣的评测案例:一名学生使用 OpenClaw 做作业,但不希望被发现使用 AI;一名老师也用 OpenClaw 来批改学生作业,希望点评具体且友好。

评测设置——学生与老师都使用 AI!

我们发现,在组合优化方法下,OpenClaw 在学生设置中只需 36 次解题交互、在老师设置中只需 24 次批改交互,就能获得显著且肉眼可见的提升。

搭建与算法细节请见 ./openclaw-test/README.md

3. OpenClaw 设置

从本仓库内捆绑的版本安装 OpenClaw(我们会定期更新):

如果你希望在捆绑的 OpenClaw 运行时中进行基于本地文件的技能编写,请参见 openclaw/extensions/skill-bridge/README.md

然后配置 OpenClaw,将请求路由到你的 RL 服务器。

打开你的 openclaw.json(或等效的设置文件),在 “models” → “providers” 下添加一个 provider 条目:

Slime 版 RL 服务器示例:

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://HOST_IP:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

将 HOST_IP 替换为你的 RL 服务器机器的 IP 地址。apiKey 应与启动服务器时设置的 SGLANG_API_KEY 一致。

Tinker 版 RL 服务器示例:

{
  "models": {
    "providers": {
      "openclaw-rl": {
        "baseUrl": "http://localhost:30000/v1",
        "apiKey": "no-auth-needed",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b-lora",
            "name": "Qwen3 4B (OpenClaw-RL LoRA)",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

就这样——开始与你的 OpenClaw 智能体聊天吧。RL 服务器会自动收集对话轨迹、计算奖励并训练模型。你用得越多,智能体就会越强。

🔧 Agentic RL in Real-world Settings

The same asynchronous RL backbone that powers our personal-agent setting can also support large-scale optimization for these broader real-world environments.

Setting Environment Next-state signal Horizon     Terminal Shell execution sandbox stdout/stderr, exit code Long   GUI Screen state + accessibility tree Visual state diff, task progress Long   SWE Code repository + test suite Test verdicts, diff, lint output Long   Tool-call API/function execution Return values, error traces Medium

🖥️ Terminal Agent — the most widely used computer-use agent

cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.sh

See ./terminal-rl/README.md for setup details.

📟 GUI Agent — the most general computer-use agent

cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.sh

See ./gui-rl/README.md for setup details.

👨‍💻 SWE Agent — software engineering agent

cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.sh

See ./swe-rl/README.md for setup details.

🛠️ Tool-call Agent — the most practical agent

cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.sh

See ./toolcall-rl/README.md for setup details.

🔧 真实世界设置中的智能体 RL

支撑我们个人智能体设置的同一套异步 RL 主干,也能为这些更广泛的真实世界环境提供大规模优化能力。

Setting Environment Next-state signal Horizon     Terminal Shell execution sandbox stdout/stderr, exit code Long   GUI Screen state + accessibility tree Visual state diff, task progress Long   SWE Code repository + test suite Test verdicts, diff, lint output Long   Tool-call API/function execution Return values, error traces Medium

🖥️ 终端智能体——最常用的计算机使用智能体

cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.sh

搭建细节请见 ./terminal-rl/README.md

📟 GUI 智能体——最通用的计算机使用智能体

cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.sh

搭建细节请见 ./gui-rl/README.md

👨‍💻 SWE 智能体——软件工程智能体

cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.sh

搭建细节请见 ./swe-rl/README.md

🛠️ 工具调用智能体——最实用的智能体

cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.sh

搭建细节请见 ./toolcall-rl/README.md

📖 Citation

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

📖 引用

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

🙏 Acknowledgements

This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw, Tinker and Open-AgentRL.

We also build terminal RL using SETA's dataset and agent framework, GUI RL using OSWorld's evaluation scripts, SWE RL using mini-swe-agent's evaluation scripts, and tool-call RL based on the work of Retool.

We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.

🙏 致谢

本工作旨在探索更有效的智能体强化学习范式。我们的实现建立在多个优秀代码库之上,包括 slimeOpenClawTinkerOpen-AgentRL

我们还基于 SETA 的数据集与智能体框架构建终端 RL;基于 OSWorld 的评测脚本构建 GUI RL;基于 mini-swe-agent 的评测脚本构建 SWE RL;并以 Retool 的工作为基础构建工具调用 RL。

我们由衷感谢这些项目提供的宝贵洞见与高质量实现,它们极大地促进了我们的研究。

⚠️ Reminder

When using OpenClaw-RL, please do not provide sensitive personal information during conversations with the model. Also, make sure to keep your API keys secure and never expose them in prompts, logs, or shared files.

⚠️ 提醒

使用 OpenClaw-RL 时,请不要在与模型的对话中提供敏感的个人信息。同时,请务必妥善保管你的 API keys,切勿在提示词、日志或共享文件中泄露它们。

About

OpenClaw-RL: Train any agent simply by talking

 [arxiv.org/abs/2603.10165](https://arxiv.org/abs/2603.10165)

Topics

async gui-application coding slime tinker memory-systems skill-learning rlhf sglang grpo on-policy-distillation openclaw-skills open-claw

Resources

Readme

License

Apache-2.0 license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Custom properties

Stars

3.1k stars

Watchers

24 watching

Forks

289 forks

Report repository

关于

OpenClaw-RL:只需对话即可训练任意智能体

 [arxiv.org/abs/2603.10165](https://arxiv.org/abs/2603.10165)

主题

async gui-application coding slime tinker memory-systems skill-learning rlhf sglang grpo on-policy-distillation openclaw-skills open-claw

资源

Readme

许可证

Apache-2.0 许可证

哎呀!

加载时发生错误。请重新加载此页面。

动态

自定义属性

Stars

3.1k stars

Watchers

24 watching

Forks

289 forks

举报仓库

Releases

No releases published

发布

未发布任何版本

[Packages

  0](https://github.com/orgs/Gen-Verse/packages?repo_name=OpenClaw-RL)

Uh oh!

There was an error while loading. Please reload this page.

[Packages

  0](https://github.com/orgs/Gen-Verse/packages?repo_name=OpenClaw-RL)

哎呀!

加载时发生错误。请重新加载此页面。

Contributors

Uh oh!

There was an error while loading. Please reload this page.

贡献者

哎呀!

加载时发生错误。请重新加载此页面。

Footer

2026 GitHub,Inc.

Footer navigation

页脚

2026 GitHub,Inc.

页脚导航

GitHub - Gen-Verse/OpenClaw-RL: OpenClaw-RL: Train any agent simply by talking · GitHub

Navigation Menu

Search or jump to...

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel Create saved search

  Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.     Dismiss alert

{{ message }}

 [Gen-Verse](https://github.com/Gen-Verse)  /  [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)  Public

Star 3.1k

Gen-Verse/OpenClaw-RL

main

BranchesTags

Go to file

Code

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

104 Commits

104 Commits

Megatron-LM

Megatron-LM

assets

assets

gui-rl

gui-rl

instructions

instructions

openclaw-combine

openclaw-combine

openclaw-opd

openclaw-opd

openclaw-rl

openclaw-rl

openclaw-test

openclaw-test

openclaw-tinker

openclaw-tinker

openclaw

openclaw

slime

slime

swe-rl

swe-rl

terminal-rl

terminal-rl

toolcall-rl

toolcall-rl

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

View all files

Repository files navigation

OpenClaw-RL

Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.

Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.

 demo.mp4

📰 News

  • [2026/3/13] 🚀 OpenClaw-RL now supports both local GPU and cloud (Tinker) deployment. Launch with one line of code — Hybrid RL, OPD, and Binary RL all supported!

  • [2026/3/12] 🔥 We support LoRA training now!

  • [2026/3/10] 🔥 We have released our Technical Report! 🏆 Ranked #1 on HuggingFace Daily Papers!

  • [2026/3/10] 🔥 Huge updates today! We released a new combination method, along with an interesting evaluation of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across terminal, GUI, SWE, and tool-call scenarios. We only focus on real-world settings!

  • [2026/3/3] 🙌 Working with the authors of SDFT and SDPO, we have integrated their methods into openclaw-opd. We welcome the integration of novel and effective methods!

  • [2026/3/3] 📺 Check out these community tutorial videos on OpenClaw-RL: Video 1 | Video 2

  • [2026/2/26] 🔥 We release OpenClaw-RL v1 — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

💡 TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

Highlights: Fully async 4-component loop · Self-hosted private · Zero manual labeling · Three learning paradigms (Binary RL / OPD / Combine) · Personal + General agent support

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.

Self-Hosted Private by Design

The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.

From Feedback to Gradient — Automatically

You do not need to manually label data. The system automatically:

  • Organizes multi-turn interactions into session-aware training trajectories

  • Classifies API messages into main-line (trainable) vs. side (non-trainable) turns

  • Uses the next user, environment, or tool feedback as a natural "next-state" signal

  • Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring

  • Submits ready samples to the trainer as they become available

Three Optimization Methods in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.

Combination Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.

From Personal Agents to Real-World Agentic RL

The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.

🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

✅ Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD ✅ Best recipe discovery via demonstration experiments ✅ Support LoRA Training ✅ Deploy training on Tinker ⬜ Support low-precision training/inference ⬜ Beyond the policy: extend learning to skills and memory

Track 2 — General Agents Optimization (Scalable Infra)

✅ Release Track 2: Scalable agentic RL infra for general agents ⬜ Support more cloud services

🤝 Contributing

We welcome contributions that integrate new learning methods into the OpenClaw-RL framework! The integration of SDFT / SDPO into openclaw-opd, and supporting LoRA are great examples of successful community contributions.

Highly wanted contributions:

  • 🤖 Qwen3.5 model support with slime — launch scripts and model configs for the Qwen3.5 family

  • 🔧 Low-precision training examples — FP8/INT4 training scripts for existing methods

📋 Full contribution guidelines feature wishlist

Call for Contributions

We welcome community contributions to OpenClaw-RL! This document outlines our contribution principles and the features we'd love help with.

Contribution Guidelines

OpenClaw-RL is organized as a collection of self-contained method folders (e.g., openclaw-rl/, openclaw-opd/, openclaw-combine/), each sitting alongside the shared slime/ training framework and openclaw/ runtime.

Contributions generally fall into two categories:

Adding a new method or deployment target

Create a new top-level folder (parallel to existing ones like openclaw-opd/). All method-specific code — launch scripts, custom loss functions, rollout logic, API server adapters, data processing, and the README — should live inside this folder.

Extending an existing method

For changes within an existing method folder — such as supporting a new model family, adding a LoRA variant, or a low-precision example — add new files (e.g., a new .sh script, a new data processing script) rather than modifying existing ones. This way the original working examples stay intact and your addition can be reviewed independently.

General principles

Do not modify the core framework. Avoid changes to slime/, Megatron-LM/, or openclaw/ unless absolutely necessary. The framework exposes extension points (--custom-loss-function-path, --rollout-function-path, --custom-generate-function-path, --custom-rm-path, etc.) specifically so that new methods can plug in without touching shared code. If a framework change is truly needed, please open a separate PR for it with a clear justification.

Include documentation. For a new method folder, add a README.md explaining what the method does, how to run it, key environment variables, and file structure. For additions to existing folders, update the existing README.md with a new section. See openclaw-combine/README.md or toolcall-rl/README.md for good examples.

Follow existing conventions. Use the same shell script structure (GPU partitioning, CKPT_ARGS, ROLLOUT_ARGS, OPTIMIZER_ARGS, etc.), environment variable naming, and ray job submit launch pattern used by the existing methods.

Highly Preferred Features

1. 🤖 Qwen3.5 Model Support of slime

Type: Extend existing method folders

Goal: Add launch scripts and model configurations for the Qwen3.5 family across existing methods.

Requirements:

  • Add new .sh scripts for Qwen3.5 in relevant method folders (e.g., openclaw-combine/run_qwen35_4b_openclaw_combine.sh).

  • Add the corresponding model config in slime/scripts/models/ if Qwen3.5 requires different architecture parameters (hidden size, num layers, etc.) from Qwen3.

  • Verify and document any changes needed for tokenizer, chat template, reasoning parser, or tool-call parser compatibility.

  • Update READMEs to list Qwen3.5 as a supported model.

2. 🔧 Low-Precision Training/Inference Examples

Type: Extend existing method folders

Goal: Add low-precision (e.g., INT8/INT4 inference, BF16/FP8 training) example scripts to existing method folders, enabling users to run OpenClaw-RL on consumer-grade hardware with fewer GPUs.

Requirements:

  • Add new .sh scripts within existing method folders — do not modify existing scripts.

  • Low-precision inference: demonstrate launching the SGLang rollout engine with quantized weights (e.g., AWQ/GPTQ INT4) to reduce VRAM for the serving side.

  • Low-precision training: if supported by the Megatron backend, demonstrate FP8 or mixed-precision configurations that reduce training memory.

  • Update the corresponding README.md in each method folder with a new section documenting these scripts.

If you're interested in any of these, feel free to open an issue to discuss your approach before submitting a PR. We're happy to provide guidance and review!

📝 Contents

🔧 Personal Agent Optimization Quick Start

1. Deployment Options

Don't have any money?

  • Hardware: 8× GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)

  • Software: CUDA 12.9, Python 3.12

  • Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

Don't have a GPU?

Create a Tinker API. That's all you need. But note that Tinker only supports LoRA, which may not be as effective as full fine-tuning. So we are still testing it.

2. Start the RL Server

We provide three methods (RL servers):

Dimension [Binary RL](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-rl) [OPD](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-opd) [Combined](https://github.com/Gen-Verse/OpenClaw-RL/blob/main/openclaw-combine)     Signal type Evaluative (good / bad) Directional Evaluative + directional   Advantage Sequence-level scalar Token-level directional Mixed sequence and token-level   Density All scored turns Hint-accepted turns only All scored turns   Feedback type User / environment Explicit corrections Both implicit and explicit feedback   Signal richness 1 scalar per sample 1 value per token 1 value per token

Choose your optimization method:

Option A: Combination Method — Recommended !

cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.sh

This method combines binary RL and OPD to achieve the best optimization.

See ./openclaw-combine/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1 --w-opd 1.0 --w-rl 1.0

see ./openclaw-tinker/README.md for setup details.

Option B: Binary RL — Best for implicit feedback (likes/dislikes, env success/failure)

cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

The PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., 👍/👎) to help the model optimize effectively.

See ./openclaw-rl/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method rl --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 3

see ./openclaw-tinker/README.md for setup details.

Option C: On-Policy Distillation (OPD) — Best for rich textual feedback

cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

The system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").

See ./openclaw-opd/README.md for algorithm details.

With LoRA (parameter-efficient, fewer GPUs):

bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.sh

With Tinker (No GPUs at all)

cd openclaw-tinker
python run.py --method opd --model-name Qwen/Qwen3-8B --batch-size 16 --prm-m 1

see ./openclaw-tinker/README.md for setup details.

Once running, the model is served as an OpenAI-compatible API at:

http://HOST_IP:30000/v1

where HOST_IP is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.

Take note of this endpoint — you will need it when configuring OpenClaw in the next step.

We also provide an interesting case for evaluation. A student who uses OpenClaw to do homework, does not want to be found using AI. A teacher who also uses OpenClaw to grade student's homework, wants the comments to be specific and friendly.

Evaluation Setting — Both student and teacher use AI!

We find that, under the combined optimization method, OpenClaw needs only 36 problem-solving interactions in the student setting and 24 grading interactions in the teacher setting to achieve a significant and clearly visible improvement.

See ./openclaw-test/README.md for setup and algorithm details.

3. OpenClaw Setup

Install OpenClaw from the version bundled in this repository (we will update it regularly):

If you want local file-backed skill authoring in the bundled OpenClaw runtime, see openclaw/extensions/skill-bridge/README.md.

Then configure OpenClaw to route requests to your RL server.

Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" → "providers":

Example of Slime-based RL server:

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://HOST_IP:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace HOST_IP with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.

Example of Tinker-based RL server:

{
  "models": {
    "providers": {
      "openclaw-rl": {
        "baseUrl": "http://localhost:30000/v1",
        "apiKey": "no-auth-needed",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b-lora",
            "name": "Qwen3 4B (OpenClaw-RL LoRA)",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

That's it — start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.

🔧 Agentic RL in Real-world Settings

The same asynchronous RL backbone that powers our personal-agent setting can also support large-scale optimization for these broader real-world environments.

Setting Environment Next-state signal Horizon     Terminal Shell execution sandbox stdout/stderr, exit code Long   GUI Screen state + accessibility tree Visual state diff, task progress Long   SWE Code repository + test suite Test verdicts, diff, lint output Long   Tool-call API/function execution Return values, error traces Medium

🖥️ Terminal Agent — the most widely used computer-use agent

cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.sh

See ./terminal-rl/README.md for setup details.

📟 GUI Agent — the most general computer-use agent

cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.sh

See ./gui-rl/README.md for setup details.

👨‍💻 SWE Agent — software engineering agent

cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.sh

See ./swe-rl/README.md for setup details.

🛠️ Tool-call Agent — the most practical agent

cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.sh

See ./toolcall-rl/README.md for setup details.

📖 Citation

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

🙏 Acknowledgements

This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw, Tinker and Open-AgentRL.

We also build terminal RL using SETA's dataset and agent framework, GUI RL using OSWorld's evaluation scripts, SWE RL using mini-swe-agent's evaluation scripts, and tool-call RL based on the work of Retool.

We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.

⚠️ Reminder

When using OpenClaw-RL, please do not provide sensitive personal information during conversations with the model. Also, make sure to keep your API keys secure and never expose them in prompts, logs, or shared files.

About

OpenClaw-RL: Train any agent simply by talking

 [arxiv.org/abs/2603.10165](https://arxiv.org/abs/2603.10165)

Topics

async gui-application coding slime tinker memory-systems skill-learning rlhf sglang grpo on-policy-distillation openclaw-skills open-claw

Resources

Readme

License

Apache-2.0 license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Custom properties

Stars

3.1k stars

Watchers

24 watching

Forks

289 forks

Report repository

Releases

No releases published

[Packages

  0](https://github.com/orgs/Gen-Verse/packages?repo_name=OpenClaw-RL)

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Footer

2026 GitHub,Inc.

Footer navigation

📋 讨论归档

讨论进行中…