mirror of
https://github.com/langbot-app/LangBot.git
synced 2026-06-11 00:06:04 +00:00
Merge remote-tracking branch 'origin/master' into refactor/eba
# Conflicts: # pyproject.toml # uv.lock
This commit is contained in:
595
docs/review/box-architecture.md
Normal file
595
docs/review/box-architecture.md
Normal file
@@ -0,0 +1,595 @@
|
||||
# Box 系统架构深度分析
|
||||
|
||||
> 更新日期: 2026-06-02
|
||||
> 状态更新: 自部署社区版已具备发布条件(box 可选、降级完善、无迁移欠债);工具调用循环上限、配额遍历异步化、`host_path` 挂载白名单等已落地。剩余多租户 / 安全硬化项见 [SaaS 阻塞项清单](./box-issues.md)。
|
||||
> 分支: `feat/sandbox` (LangBot + langbot-plugin-sdk)
|
||||
> 相关文档: [SaaS 阻塞项](./box-issues.md) | [Session 作用域](./box-session-scope.md) | [Runtime 对比](./box-vs-plugin-runtime.md) | [测试覆盖](./box-test-coverage.md) | [toB 分析](./box-tob-analysis.md)
|
||||
|
||||
---
|
||||
|
||||
## 1. 全局架构
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ LangBot 主进程 │
|
||||
│ │
|
||||
│ LocalAgentRunner ──> ToolManager ──> NativeToolLoader │
|
||||
│ │ │ │ │
|
||||
│ │ │ exec / read / write / edit │
|
||||
│ │ │ glob / grep │
|
||||
│ │ │ │
|
||||
│ │ ├──> MCPLoader ──> BoxStdioSession │
|
||||
│ │ │ (shared 容器, 多 process) │
|
||||
│ │ │ │
|
||||
│ │ ├──> SkillToolLoader (activate 工具) │
|
||||
│ │ │ │
|
||||
│ │ ├──> SkillAuthoringToolLoader │
|
||||
│ │ │ │
|
||||
│ │ └──> PluginToolLoader │
|
||||
│ │ │
|
||||
│ BoxService (门面) │
|
||||
│ ├─ Profile 管理 (locked 字段) │
|
||||
│ ├─ Host mount 校验 (allowed_mount_roots) │
|
||||
│ ├─ Workspace quota 检查 │
|
||||
│ ├─ 输出截断 (head+tail) │
|
||||
│ ├─ Session ID 模板解析 (resolve_box_session_id) │
|
||||
│ ├─ 技能挂载组装 (build_skill_extra_mounts) │
|
||||
│ ├─ 重连循环 (_reconnect_loop, 指数退避) │
|
||||
│ └─ BoxRuntimeConnector │
|
||||
│ ├─ 心跳 loop (20s ping) │
|
||||
│ └─ ActionRPCBoxClient │
|
||||
│ │ Action RPC (stdio 或 WebSocket) │
|
||||
│ │
|
||||
│ SkillManager (skill_mgr) │
|
||||
│ └─ 从 Box runtime 拉取 skills, 不可用时回落 data/skills │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Box Runtime 进程 (SDK 侧) │
|
||||
│ │
|
||||
│ BoxServerHandler (Action RPC 处理, INIT 配置注入) │
|
||||
│ │ │
|
||||
│ BoxRuntime (session 管理 / 进程生命周期 / TTL reaper) │
|
||||
│ │ └─ session.managed_processes: dict[pid, _ManagedProcess]
|
||||
│ │ │
|
||||
│ Backend (启动时根据 box.backend 配置选择): │
|
||||
│ DockerBackend ──┐ │
|
||||
│ PodmanBackend ──┤── CLISandboxBackend │
|
||||
│ NsjailBackend ──┘ (本地 CLI 或 fallback 到容器内 CLI) │
|
||||
│ E2BBackend (云沙箱, 需要 E2B_API_KEY) │
|
||||
│ │
|
||||
│ BoxSkillStore │
|
||||
│ ├─ list / get / create / update / delete │
|
||||
│ ├─ scan_skill_directory / read_skill_file / write_skill_file │
|
||||
│ └─ preview_skill_zip / install_skill_zip (zip 或 GitHub) │
|
||||
│ │
|
||||
│ aiohttp 单端口服务 (默认 :5410): │
|
||||
│ /rpc/ws — Action RPC │
|
||||
│ /v1/sessions/{id}/managed-process/ws — 默认 process │
|
||||
│ /v1/sessions/{id}/managed-process/{pid}/ws — 指定 process │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ 容器 / 沙箱 (Docker/Podman 容器, nsjail sandbox, 或 E2B 远程沙箱) │
|
||||
│ - 隔离文件系统 / 网络 / PID 命名空间 │
|
||||
│ - 资源限制 (CPU, 内存, PID 数, 可选 workspace 配额) │
|
||||
│ - 主挂载 (host_path → mount_path) + 任意条 extra_mounts │
|
||||
│ └─ Skills 通过 extra_mounts 挂在 /workspace/.skills/<name> │
|
||||
│ - exec: 用户命令在此执行 │
|
||||
│ - managed process: 多个长驻进程并存 (MCP Server / 自定义服务) │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**核心设计原则**:
|
||||
- Box Runtime 作为独立进程运行,通过 Action RPC 与 LangBot 主进程通信,两者复用 SDK 的 IO 层(Handler → Connection → Controller)
|
||||
- 一个 session_id 对应一个容器/沙箱实例。同一 session 内可并存多条 mount 与多个 managed process
|
||||
- Skill / 默认 exec / MCP Server 共享同一个 session 容器(详见 [box-session-scope.md](./box-session-scope.md))
|
||||
|
||||
---
|
||||
|
||||
## 2. LangBot 侧模块
|
||||
|
||||
### 2.1 BoxService (`pkg/box/service.py`, 722 行)
|
||||
|
||||
应用层门面,协调 Profile、安全校验、配额、连接、Skill 挂载与 Session 模板:
|
||||
|
||||
主要公开方法(按定义顺序):
|
||||
|
||||
```
|
||||
BoxService
|
||||
├─ initialize() 连接 Box Runtime + 默认 workspace 准备
|
||||
├─ _on_runtime_disconnect(connector) 触发重连
|
||||
├─ _reconnect_loop(connector) 指数退避重连
|
||||
├─ available (property) 连接状态
|
||||
│
|
||||
├─ resolve_box_session_id(query) 从 pipeline 模板解析 session_id
|
||||
├─ build_skill_extra_mounts(query) 组装 pipeline-bound skill 的挂载列表
|
||||
│
|
||||
├─ execute_tool(parameters, query) Agent 调用 exec 时的入口
|
||||
│ ├─ _apply_profile / build_spec
|
||||
│ ├─ _validate_host_mount
|
||||
│ ├─ _enforce_workspace_quota (phase=pre)
|
||||
│ ├─ client.execute(spec)
|
||||
│ ├─ _enforce_workspace_quota (phase=post)
|
||||
│ └─ _truncate (stdout/stderr)
|
||||
│
|
||||
├─ execute_spec_payload(spec_payload, ...) 内部入口(其他 loader 调用)
|
||||
├─ create_session(spec_payload, ...) 显式创建 session
|
||||
├─ start_managed_process(session_id, ...) 启动 managed process
|
||||
├─ get_managed_process(session_id, pid) 查询进程状态(pid 默认 'default')
|
||||
├─ stop_managed_process(session_id, pid) 单独停止某个 managed process
|
||||
├─ get_managed_process_websocket_url(...) 返回 WS attach URL
|
||||
│
|
||||
├─ list_skills() / get_skill(name) Skill 元数据
|
||||
├─ create_skill / update_skill / delete_skill Skill CRUD
|
||||
├─ scan_skill_directory(path) 扫描目录
|
||||
├─ list_skill_files / read_skill_file / write_skill_file
|
||||
├─ preview_skill_zip / install_skill_zip zip / GitHub 安装
|
||||
│
|
||||
├─ shutdown() / dispose() 清理:RPC SHUTDOWN + 进程终止
|
||||
├─ get_status() / get_sessions() / get_recent_errors()
|
||||
└─ get_system_guidance() LLM 系统提示
|
||||
```
|
||||
|
||||
**Profile 系统**: 4 个内置 Profile(`default` / `offline_readonly` / `network_basic` / `network_extended`),`locked` frozenset 字段不可被 LLM 覆盖。参数合并顺序:Profile defaults → LLM 请求参数 → locked 强制值。
|
||||
|
||||
**输出截断**: 默认 4000 字符上限,保留前 60% + 后 40%,中间插入 `[...truncated...]`。
|
||||
|
||||
**Skill 挂载合并**: `execute_tool()` 调用时,`build_skill_extra_mounts(query)` 会把当前 pipeline-bound 的所有 skill 的 `package_root` 作为 `extra_mounts` 加入 BoxSpec,挂在 `/workspace/.skills/<name>`。LLM 通过 `activate` 工具显式激活某个 skill 后,工具调用才允许引用这个 skill 的虚拟路径。
|
||||
|
||||
### 2.2 BoxRuntimeConnector (`pkg/box/connector.py`, 357 行)
|
||||
|
||||
管理与 Box Runtime 的通信连接:
|
||||
|
||||
- **本地 stdio**: Unix/macOS 默认路径,fork `python -m langbot_plugin.cli.__init__ box -s --ws-control-port {port}` 子进程(与 plugin runtime 统一走 `lbp` CLI 入口)
|
||||
- **本地 subprocess + WS**: Windows 本地(asyncio ProactorEventLoop 不支持 stdio pipe)
|
||||
- **远程 WebSocket**: Docker 部署 / `box.runtime.endpoint` 显式配置时,连接 `ws://{host}:{port}/rpc/ws`
|
||||
- **同步等待**: `asyncio.Event` + `wait_for(timeout=30s)` 模式确认连接
|
||||
- **心跳**: `_heartbeat_loop()` 每 20s 调用 `ping()`,失败仅 DEBUG 日志(断开检测靠 connection close)
|
||||
- **重连**: `runtime_disconnect_callback` 由 BoxService 提供,触发 `_reconnect_loop`
|
||||
- **INIT 注入**: 连接建立后立即下发当前 `box.*` 配置子树(剔除 `runtime` 私有字段),Runtime 据此初始化 backend
|
||||
|
||||
> **历史改进**: 2026-04-16 版本本文档曾列 P0 「Box 无心跳 / 无重连」,已修复(commit `2dfd9d5d`、`c6882cf`、`5029d9c` 等)。
|
||||
|
||||
### 2.3 BoxWorkspaceSession 工具 (`pkg/box/workspace.py`, 413 行)
|
||||
|
||||
此文件目前提供两类能力:
|
||||
|
||||
1. **路径与命令重写工具函数** — `normalize_host_path` / `rewrite_mounted_path` / `unwrap_venv_path` / `rewrite_venv_command` / `infer_workspace_host_path`,被 MCP loader 与 Skill 路径解析共用。
|
||||
2. **`BoxWorkspaceSession`** — 围绕 BoxService 的轻量包装,专供 MCP-in-Box 场景使用(管理一个共享 session 的 session_id、构建挂载 payload、stage host 文件到共享 workspace)。
|
||||
|
||||
**变化点**: 早期 Skill exec 会为每个 skill 创建独立 BoxWorkspaceSession(独占 session);当前实现已转为 `extra_mounts` 模式,Skill 不再独占容器,只追加挂载。这部分 wrapping 逻辑已从 native loader 移除。
|
||||
|
||||
### 2.4 policy.py (`pkg/box/policy.py`, 98 行) — 仍是死代码
|
||||
|
||||
三层安全策略设计(`SandboxPolicy` / `ToolPolicy` / `ElevatedPolicy`),全项目无任何导入或调用。详见 [SaaS 阻塞项 S2](./box-issues.md)。
|
||||
|
||||
### 2.5 SkillManager (`pkg/skill/manager.py`, 186 行)
|
||||
|
||||
```
|
||||
SkillManager
|
||||
├─ initialize() 调用 reload_skills()
|
||||
├─ reload_skills() 先从 Box runtime list_skills(),
|
||||
│ 不可用则回落 data/skills/ 扫描
|
||||
├─ refresh_skill_from_disk() 单 skill 重新加载
|
||||
├─ get_skill_by_name(name)
|
||||
└─ get_managed_skills_root() 返回 Box 视角的 skills_root 路径
|
||||
```
|
||||
|
||||
skill 元数据通过 `parse_frontmatter` 解析 `SKILL.md` 头部(`name` / `description` / `instructions`),不再做整体扫描的代价(典型 < 50 个)。
|
||||
|
||||
### 2.6 Skill activation (`pkg/skill/activation.py`, 33 行) + Skill loader 辅助
|
||||
|
||||
历史上 skill 通过 LLM 在文本中输出 `[ACTIVATE_SKILL:name]` 标记激活;当前已改为 **Tool Call 机制**:
|
||||
|
||||
- `SkillToolLoader` (`pkg/provider/tools/loaders/skill.py`, 157 行) 暴露 `activate` 工具,参数为 skill 名
|
||||
- 工具实现调用 `register_activated_skill(query, skill_data)`,将激活态写入 `query.variables['_activated_skills']`
|
||||
- 这种 KV-cache-friendly 模式对齐 Claude Code 设计;详见 [box-session-scope.md §4.3](./box-session-scope.md) 的 Tool Call 描述
|
||||
|
||||
`activation.py` 现仅保留对外辅助函数(pipeline 层调用 loader 的 `register_activated_skill`)。
|
||||
|
||||
---
|
||||
|
||||
## 3. SDK 侧模块
|
||||
|
||||
### 3.1 BoxRuntime (`box/runtime.py`, 599 行)
|
||||
|
||||
核心编排器,管理 session 生命周期与 backend 调度:
|
||||
|
||||
```
|
||||
Session 生命周期:
|
||||
|
||||
Client EXEC / CREATE_SESSION
|
||||
│
|
||||
▼
|
||||
_get_or_create_session(spec)
|
||||
├─ _reap_expired_sessions_locked() 清理 TTL 过期 session
|
||||
├─ 已存在? → _assert_session_compatible() → 复用
|
||||
├─ Backend session 失踪? → 重建 (commit c6882cf)
|
||||
└─ 新建? → backend.start_session(spec) → 创建容器
|
||||
│ └─ 应用 spec.extra_mounts (多挂载)
|
||||
▼
|
||||
execute(spec)
|
||||
├─ 获取 session lock (每 session 独立)
|
||||
├─ backend.exec(session, spec) 在容器中执行命令
|
||||
├─ 更新 last_used_at
|
||||
└─ 超时? → 销毁 session
|
||||
│
|
||||
▼
|
||||
Session 保持存活直到:
|
||||
├─ TTL 过期 (默认 300s,下次操作时清理)
|
||||
├─ 执行超时 (自动销毁)
|
||||
├─ 客户端 DELETE_SESSION
|
||||
└─ SHUTDOWN
|
||||
```
|
||||
|
||||
**关键设计**:
|
||||
- 每 session 有独立 `asyncio.Lock`,同一 session 内的命令串行执行
|
||||
- 每 session 维护 `managed_processes: dict[process_id, _ManagedProcess]`,支持多个长驻进程并存(MCP / 自定义)
|
||||
- 全局 `_lock` 保护 `_sessions` dict 的读写
|
||||
- 兼容性检查:比较核心 spec 字段,`image` 字段对不支持自定义镜像的 backend(nsjail/E2B)会跳过
|
||||
|
||||
**Backend 选择 (`_select_backend`)**: 优先级
|
||||
1. 显式 `box.backend` 配置(`docker` / `nsjail` / `e2b`)
|
||||
2. `local` (默认) → Docker / Podman / nsjail CLI 顺序探测
|
||||
3. `get_status` 调用时若当前 backend 不可用,会尝试重新选择 (commit `e5617c7`)
|
||||
|
||||
### 3.2 Backend 系统
|
||||
|
||||
#### CLISandboxBackend (`box/backend.py`, 411 行)
|
||||
|
||||
Docker / Podman 公共基类:
|
||||
|
||||
```
|
||||
start_session(spec):
|
||||
1. validate_sandbox_security(spec)
|
||||
2. docker/podman run -d --rm --name <name>
|
||||
--network none (可选)
|
||||
--cpus/--memory/--pids-limit
|
||||
--read-only + --tmpfs /tmp
|
||||
-v <host>:<mount>:<mode> 主挂载
|
||||
-v <extra.host>:<extra.mount>:.. 额外挂载 (extra_mounts)
|
||||
<image> sh -lc 'while true; do sleep 3600; done'
|
||||
3. 返回 BoxSessionInfo
|
||||
|
||||
exec(session, spec):
|
||||
docker/podman exec -e KEY=VAL <container>
|
||||
sh -lc 'mkdir -p <workdir> && cd <workdir> && <cmd>'
|
||||
|
||||
start_managed_process(session, spec):
|
||||
docker/podman exec -i <container>
|
||||
sh -lc 'mkdir -p <cwd> && cd <cwd> && exec <command> <args>'
|
||||
返回 asyncio.subprocess.Process (stdin/stdout PIPE)
|
||||
```
|
||||
|
||||
容器以 idle 进程启动,实际命令通过 `docker exec` 执行。`--rm` 确保容器退出时自动清理。
|
||||
|
||||
**Windows 支持**: backend 内对 Windows 路径处理与 subprocess 调用做了适配(commit `120817a`)。
|
||||
|
||||
**孤儿清理**: 启动时枚举 `langbot.box=true` 标签的容器,instance_id 不匹配的强制删除。
|
||||
|
||||
#### NsjailBackend (`box/nsjail_backend.py`, 552 行)
|
||||
|
||||
轻量级 Linux 沙箱(无容器引擎依赖):
|
||||
|
||||
- 使用 namespace 隔离(user/mount/pid/ipc/uts/cgroup/net)
|
||||
- 挂载宿主 `/usr`/`/lib`/`/bin`/`/sbin` 只读 + 选定 `/etc` 条目
|
||||
- 每 session 创建独立目录(workspace/tmp/home)
|
||||
- 资源限制: cgroup v2 优先,fallback 到 rlimit
|
||||
- **CLI 兼容**: 通过 `shutil.which(self._nsjail_bin)` 检测系统安装版 nsjail;不存在时再尝试容器内 nsjail(commit `686fcc0`、`feed530`)
|
||||
- **无自定义镜像**: 使用宿主 OS,`image` 字段固定为 `'host'`,兼容性检查跳过 image
|
||||
|
||||
#### E2BBackend (`box/e2b_backend.py`, 429 行)
|
||||
|
||||
云沙箱后端(commit `75b547f` 引入):
|
||||
|
||||
- 通过 `e2b` SDK 与 E2B 平台通信
|
||||
- 配置:`box.e2b.api_key` / `api_url` / `template`
|
||||
- 支持 `extra_mounts`(commit `0fea9b1` 同步上传文件)
|
||||
- 无本地容器引擎依赖,适合无 Docker 的部署或 SaaS 多租户场景
|
||||
- 不支持自定义 image 字段,由 template 控制
|
||||
|
||||
### 3.3 Server (`box/server.py`, 508 行)
|
||||
|
||||
单端口 aiohttp 服务(默认 5410),通过路径区分(commit `8c71ec5` 合并端口):
|
||||
|
||||
1. **Action RPC** (`/rpc/ws`): `BoxServerHandler` 处理所有 action,包括 `INIT` 配置注入、skill store 操作等
|
||||
2. **WS Relay** (`/v1/sessions/{id}/managed-process/ws` 与 `/v1/sessions/{id}/managed-process/{pid}/ws`): 双向桥接 WebSocket ↔ 指定 managed process stdin/stdout
|
||||
|
||||
stdio 模式同样会在 5410 启动 aiohttp,专门承担 managed process attach;Action RPC 走 stdin/stdout。
|
||||
|
||||
### 3.4 Client (`box/client.py`, 377 行)
|
||||
|
||||
`ActionRPCBoxClient` 封装 `Handler.call_action()` 调用:
|
||||
|
||||
- 25+ 方法对应 25+ 个 RPC action(exec / session / managed-process / skill / status / shutdown)
|
||||
- 错误还原: `_translate_action_error()` 通过字符串前缀匹配还原 SDK 侧异常类型
|
||||
- `execute()` timeout = 300s,其他默认 15s
|
||||
- `BoxRuntimeClient` 是 ABC,供后续可能的非 RPC 实现复用
|
||||
|
||||
包级别 `__init__.py` 显式导出:`BoxRuntimeClient`、`ActionRPCBoxClient`(commit `df9c722`)。
|
||||
|
||||
### 3.5 Actions (`box/actions.py`, 34 行)
|
||||
|
||||
`LangBotToBoxAction` 枚举共定义 **25 个** action:
|
||||
|
||||
| 类别 | Actions |
|
||||
|------|---------|
|
||||
| 控制 | `INIT`、`HEALTH`、`STATUS`、`GET_BACKEND_INFO`、`SHUTDOWN` |
|
||||
| 执行 | `EXEC` |
|
||||
| Session | `CREATE_SESSION` / `GET_SESSION` / `GET_SESSIONS` / `DELETE_SESSION` |
|
||||
| Managed Process | `START_MANAGED_PROCESS` / `GET_MANAGED_PROCESS` / `STOP_MANAGED_PROCESS` |
|
||||
| Skill | `LIST_SKILLS` / `GET_SKILL` / `CREATE_SKILL` / `UPDATE_SKILL` / `DELETE_SKILL` / `SCAN_SKILL_DIRECTORY` / `LIST_SKILL_FILES` / `READ_SKILL_FILE` / `WRITE_SKILL_FILE` / `PREVIEW_SKILL_ZIP` / `INSTALL_SKILL_ZIP` |
|
||||
|
||||
### 3.6 Models (`box/models.py`, 331 行)
|
||||
|
||||
核心数据模型:
|
||||
|
||||
| 模型 | 用途 |
|
||||
|------|------|
|
||||
| `BoxNetworkMode` | `OFF` / `ON` |
|
||||
| `BoxExecutionStatus` | `COMPLETED` / `TIMED_OUT` |
|
||||
| `BoxHostMountMode` | `NONE` / `READ_ONLY` / `READ_WRITE` |
|
||||
| `BoxManagedProcessStatus` | `RUNNING` / `EXITED` |
|
||||
| `BoxMountSpec` | 单条挂载(host_path/mount_path/mode)— **新增** |
|
||||
| `BoxSpec` | 执行请求;新增 `extra_mounts: list[BoxMountSpec]`、`persistent`、`workspace_quota_mb` |
|
||||
| `BoxProfile` | 4 个内置 Profile + `locked` frozenset |
|
||||
| `BoxSessionInfo` | Session 状态(含 backend_name/created_at/last_used_at) |
|
||||
| `BoxManagedProcessSpec` | 长驻进程参数(process_id/command/args/env/cwd) |
|
||||
| `BoxManagedProcessInfo` | 进程状态(status/exit_code/stderr_preview/attached) |
|
||||
| `BoxExecutionResult` | 执行结果(status/exit_code/stdout/stderr/duration_ms) |
|
||||
|
||||
`BoxSpec` 校验器: `workdir` 默认继承 `mount_path`;`host_path` 支持 POSIX 和 Windows 路径;设置 `host_path` 时 `workdir` 必须在 `mount_path` 下。
|
||||
|
||||
### 3.7 BoxSkillStore (`box/skill_store.py`, 647 行)
|
||||
|
||||
新增模块(commit `4ab3502`),把 skill 持久化收归 Box runtime:
|
||||
|
||||
```
|
||||
BoxSkillStore
|
||||
├─ list_skills() / get_skill(name)
|
||||
├─ create_skill(data) / update_skill(name, data) / delete_skill(name)
|
||||
├─ scan_skill_directory(path) 扫描目录返回候选 skill 包列表
|
||||
├─ list_skill_files(name, path) 浏览 skill 内文件树
|
||||
├─ read_skill_file(name, path) / write_skill_file(name, path, content)
|
||||
├─ preview_skill_zip(zip_bytes, ...) 不落盘预览 zip 内容
|
||||
└─ install_skill_zip(zip_bytes, ...) 解压、校验、复制到 skills_root
|
||||
└─ 支持 source_subdir / target_suffix(commit 1aa043f)
|
||||
```
|
||||
|
||||
GitHub 安装路径:HTTP 层(`api/http/service/skill.py`)先 `git clone` 拉取,再走 `install_skill_zip` 或 directory 路径。Skill 文件存放于 `box.local.skills_root`(默认 `skills`,相对 `host_root`),容器内对应 `/workspace/.skills/`。
|
||||
|
||||
### 3.8 Security (`box/security.py`, 52 行)
|
||||
|
||||
`validate_sandbox_security()`: 黑名单校验 host_path,阻止挂载 `/etc`/`/proc`/`/sys`/`/dev`/`/root`/`/boot` 及 Docker/Podman socket。
|
||||
|
||||
**已知缺陷**: 根路径 `/` 未拦截,用户 home 目录未拦截,是 denylist 而非 allowlist 策略。详见 [SaaS 阻塞项 S5](./box-issues.md)。
|
||||
|
||||
### 3.9 Errors (`box/errors.py`, 33 行)
|
||||
|
||||
| 异常类型 | 含义 |
|
||||
|----------|------|
|
||||
| `BoxError` | 基类 |
|
||||
| `BoxValidationError` | spec/参数校验失败 |
|
||||
| `BoxBackendUnavailableError` | 无可用 backend |
|
||||
| `BoxRuntimeUnavailableError` | Runtime 服务不可用 |
|
||||
| `BoxSessionConflictError` | session 已存在但 spec 不兼容 |
|
||||
| `BoxSessionNotFoundError` | session 不存在 |
|
||||
| `BoxManagedProcessConflictError` | session 已有同名 process |
|
||||
| `BoxManagedProcessNotFoundError` | process 不存在 |
|
||||
|
||||
---
|
||||
|
||||
## 4. 工具系统集成
|
||||
|
||||
### 4.1 ToolManager 编排 (`toolmgr.py`)
|
||||
|
||||
```
|
||||
ToolManager.initialize()
|
||||
├─ NativeToolLoader (exec / read / write / edit / glob / grep)
|
||||
├─ PluginToolLoader (插件工具)
|
||||
├─ MCPLoader (MCP Server 工具)
|
||||
├─ SkillToolLoader (activate 工具 — Tool Call 激活)
|
||||
└─ SkillAuthoringToolLoader (Skill CRUD)
|
||||
|
||||
工具调用优先级: native → plugin → mcp → skill → skill_authoring
|
||||
```
|
||||
|
||||
### 4.2 Native Tools (`native.py`, 846 行)
|
||||
|
||||
| 工具 | 是否在 Box 中执行 | 是否访问宿主文件系统 |
|
||||
|------|:---:|:---:|
|
||||
| `exec` | 是 | 否 |
|
||||
| `read` | **否** | **是** — 直接 `open()` 宿主文件 |
|
||||
| `write` | **否** | **是** — 直接 `open()` 宿主文件 |
|
||||
| `edit` | **否** | **是** — 直接 `open()` 宿主文件 |
|
||||
| `glob` | **否** | **是** — 直接遍历宿主目录 |
|
||||
| `grep` | **否** | **是** — 直接读宿主文件 |
|
||||
|
||||
**沙箱边界不对称**: 这是刻意的设计权衡 — `read`/`write`/`edit`/`glob`/`grep` 绕过沙箱以获得性能(避免容器 I/O 开销与跨进程拷贝),但意味着 LLM 可以直接读写 `allowed_mount_roots` 下任何文件。Skill 路径经 `_resolve_host_path()` 重写,禁止穿越 `package_root`。
|
||||
|
||||
**exec 的 Skill 分支**: 命令中引用 `/workspace/.skills/<name>` 的 skill 时:
|
||||
1. 验证 skill 已激活
|
||||
2. 单次 exec 只能引用一个 skill 包
|
||||
3. 若 skill 是 Python 项目(有 `requirements.txt` 或 `pyproject.toml`),命令会被 venv bootstrap 包裹(在 skill 挂载点内创建 `.venv`)
|
||||
4. 调用 `box_service.execute_tool()` → 走默认 session_id 与已组装好的 `extra_mounts`,**不再为每 skill 起独立 session**
|
||||
|
||||
### 4.3 MCP-in-Box (`mcp_stdio.py`, 354 行)
|
||||
|
||||
`BoxStdioSessionRuntime` 让 MCP stdio 服务器在 Box 容器中运行,**共享 session、多 process**模式(commit `529088e`):
|
||||
|
||||
```
|
||||
initialize()
|
||||
1. 复用/创建共享 session (session_id = _build_box_session_id())
|
||||
- persistent=True,长期保持
|
||||
2. workspace.execute_raw(install_cmd) 安装依赖 (可选)
|
||||
3. 将每个 MCP server 文件 stage 到 /workspace/.mcp/<process_id>/
|
||||
4. workspace.start_managed_process(process_id=<server>)
|
||||
5. websocket_client(ws_url) 通过 WS relay 连接
|
||||
6. ClientSession.initialize() MCP 协议握手
|
||||
```
|
||||
|
||||
配置 (`MCPServerBoxConfig`): `network='on'` (MCP 服务器通常需要网络),`host_path_mode='ro'` (默认只读),`startup_timeout_sec=120` (留时间给 pip install)。
|
||||
|
||||
每条 MCP server 是同一 session 中的一个 managed process,独立的 `process_id`、独立 attach URL,互不阻塞。
|
||||
|
||||
---
|
||||
|
||||
## 5. 启动与生命周期
|
||||
|
||||
### 5.1 启动顺序 (`build_app.py`)
|
||||
|
||||
```
|
||||
BuildAppStage.run(ap)
|
||||
├─ ... (persistence, models, sessions) ...
|
||||
│
|
||||
├─ BoxService(ap)
|
||||
├─ box_service.initialize()
|
||||
│ └─ connector.initialize()
|
||||
│ ├─ [stdio] fork box subprocess
|
||||
│ ├─ [subprocess+WS] Windows 本地
|
||||
│ └─ [remote WS] connect URL
|
||||
│ └─ 启动心跳 _heartbeat_task
|
||||
├─ ap.box_service = box_service
|
||||
│
|
||||
├─ ToolManager(ap)
|
||||
├─ tool_mgr.initialize()
|
||||
│ ├─ NativeToolLoader (检查 box_service.available)
|
||||
│ ├─ PluginToolLoader
|
||||
│ ├─ MCPLoader (Box 可用时,stdio MCP 走沙箱)
|
||||
│ └─ SkillAuthoringToolLoader
|
||||
├─ ap.tool_mgr = tool_mgr
|
||||
│
|
||||
├─ ... (platform, pipeline) ...
|
||||
├─ SkillManager.initialize() (从 Box runtime 加载 skill 列表)
|
||||
└─ ... (RAG, HTTP, plugins) ...
|
||||
```
|
||||
|
||||
BoxService 在 ToolManager **之前**初始化。ToolManager 创建 loader 时检查 `box_service.available`。
|
||||
|
||||
### 5.2 初始化失败处理
|
||||
|
||||
```python
|
||||
try:
|
||||
await self._runtime_connector.initialize()
|
||||
self._available = True
|
||||
except Exception as e:
|
||||
self._available = False
|
||||
logger.warning(f"Box runtime unavailable: {e}")
|
||||
```
|
||||
|
||||
**静默降级**: Box 初始化失败不会阻止应用启动,仅导致 6 个 native tool、所有 Skill 工具和 MCP-in-Box 工具不暴露给 LLM。与 Plugin 的行为不同(Plugin 失败会抛异常)。
|
||||
|
||||
### 5.3 销毁流程
|
||||
|
||||
```
|
||||
app.dispose()
|
||||
└─ box_service.dispose()
|
||||
├─ connector.dispose()
|
||||
│ ├─ cancel _heartbeat_task
|
||||
│ ├─ cancel _handler_task / _ctrl_task
|
||||
│ └─ terminate subprocess (SIGTERM)
|
||||
└─ loop.create_task(client.shutdown())
|
||||
└─ RPC SHUTDOWN → Box Runtime 清理所有容器
|
||||
```
|
||||
|
||||
Box 额外做了 RPC SHUTDOWN 通知 Runtime 主动清理容器,比 Plugin 的直接杀进程更安全。
|
||||
|
||||
---
|
||||
|
||||
## 6. 配置
|
||||
|
||||
### config.yaml (重构后)
|
||||
|
||||
```yaml
|
||||
box:
|
||||
enabled: true # 整个 Box 子系统的总开关。设为 false 时:
|
||||
# - 不连接远程 Box runtime,不 fork 本地 stdio 子进程
|
||||
# - sandbox 工具 (exec/read/write/edit/glob/grep) 不暴露给 LLM
|
||||
# - skill 添加/编辑 / GitHub 安装 / 文件写入全部拒绝
|
||||
# - stdio 模式的 MCP server 启动时报错(http/sse 模式不受影响)
|
||||
# - skill 列表/读取保持只读可用
|
||||
# BOX__ENABLED 环境变量可覆盖(统一约定)
|
||||
backend: 'local' # 'local' (探测) / 'docker' / 'nsjail' / 'e2b'
|
||||
# 由 box.backend / BOX__BACKEND 选择后端
|
||||
runtime:
|
||||
endpoint: '' # 外部 Runtime 的 WS 基地址 'ws://host:5410'
|
||||
# 留空 = 本地自管 Runtime
|
||||
local:
|
||||
profile: 'default'
|
||||
image: '' # 覆盖 profile 默认 image
|
||||
host_root: './data/box' # 工作区挂载根,Docker 部署需绝对路径
|
||||
default_workspace: '' # 默认 '<host_root>/default'
|
||||
skills_root: 'skills' # Box 管理的 skill 包目录(相对 host_root)
|
||||
allowed_mount_roots: # 默认 ['<host_root>']
|
||||
- './data/box'
|
||||
- '/tmp'
|
||||
workspace_quota_mb: null # 配额覆盖,null = 走 profile
|
||||
e2b:
|
||||
api_key: '' # 也可走 E2B_API_KEY 环境变量
|
||||
api_url: '' # 自托管 E2B 时填写
|
||||
template: '' # 默认 template ID
|
||||
```
|
||||
|
||||
> **重大变更**: 较 2026-04-16 文档,配置结构完全重组(commit `eefdea4`)。原字段 `box.profile` / `box.runtime_url` / `box.shared_host_root` / `box.allowed_host_mount_roots` 全部迁入 `box.local.*` 子表,新增 `box.backend` 与 `box.e2b.*` 配置组。
|
||||
|
||||
### docker-compose.yaml
|
||||
|
||||
`langbot_box` 服务受 compose profile 控制,默认 `docker compose up` **不会**启动它。需要 sandbox 时:
|
||||
|
||||
```bash
|
||||
docker compose --profile box up # 启动 langbot + langbot_box + plugin runtime
|
||||
docker compose --profile all up # 同上
|
||||
docker compose up # 只起 langbot + plugin runtime (box 关闭)
|
||||
```
|
||||
|
||||
若不起 `langbot_box`,需要同步在 `data/config.yaml` 中设 `box.enabled: false`(或 langbot 容器 env 加 `BOX__ENABLED=false`),否则 LangBot 会一直尝试连接不存在的 Box runtime 并报错。
|
||||
|
||||
```yaml
|
||||
# langbot_box 的关键 volume
|
||||
volumes:
|
||||
- ${LANGBOT_BOX_ROOT}:${LANGBOT_BOX_ROOT} # 工作区挂载(源/目标同路径)
|
||||
- /var/run/docker.sock:/var/run/docker.sock # Docker backend 复用宿主 docker
|
||||
```
|
||||
|
||||
### 关闭/连接失败时的行为矩阵
|
||||
|
||||
`box.enabled = false` 与"启用但连接失败"在用户可观察行为上**完全一致**——都通过 `BoxService.available = False` 表达,只是 `get_status` 多返回 `enabled` 字段供前端区分文案。
|
||||
|
||||
| 消费方 | Box 可用 | Box 不可用(disabled 或 failed) |
|
||||
|---|---|---|
|
||||
| native exec/read/write/edit/glob/grep 工具 | 暴露给 LLM | **不暴露** |
|
||||
| `activate` / `register_skill` 工具 | 暴露给 LLM | **不暴露** |
|
||||
| stdio MCP server | 在 Box 内启动 | **`_init_stdio_python_server` 抛 RuntimeError** 拒绝;不退化到宿主 stdio |
|
||||
| http/sse MCP server | 正常 | 正常(不依赖 Box) |
|
||||
| Skill 列表/读取 (`list_skills`/`get_skill`/`read_skill_file`) | 走 Box runtime | 走 LangBot 本地 `data/skills/` 只读 fallback |
|
||||
| Skill 创建/编辑/安装/写文件 | 走 Box runtime | **HTTP 400** + 明确错误信息(`_require_box_for_write`) |
|
||||
| Pipeline AI 配置中 `box-session-id-template` | 正常生效 | **前端 banner** 提示字段无效 |
|
||||
| Pipeline 扩展页 `enable_all_skills` / 绑定 skill | 可编辑 | **前端禁用** + banner |
|
||||
| 仪表盘 Box 状态卡片 | 绿点 / "已连接" | 灰点 / "已禁用"(disabled) 或 红点 / "已断开"(failed) |
|
||||
|
||||
> 后端拒写的边界条件:如果 `ap.box_service` **完全没装**(老式 dev mode,没经过 BuildAppStage),`_require_box_for_write` 视作 no-op,保留 `data/skills/` 本地路径——以兼容历史测试与最小化设置。生产环境总会装 `ap.box_service`,因此该 fallback 不会被触发。
|
||||
|
||||
### Pipeline 配置 (templates/metadata/pipeline/ai.yaml)
|
||||
|
||||
`local-agent.config.box-session-id-template` 控制 session 作用域,预设:
|
||||
|
||||
- `{launcher_type}_{launcher_id}` — 每个会话 (推荐,默认)
|
||||
- `{launcher_type}_{launcher_id}_{sender_id}` — 群聊每个用户
|
||||
- `{launcher_type}_{launcher_id}_{conversation_id}` — 每个对话上下文
|
||||
- `{query_id}` — 每条消息(完全隔离)
|
||||
|
||||
详见 [box-session-scope.md](./box-session-scope.md)。
|
||||
|
||||
### REST API
|
||||
|
||||
| 端点 | 方法 | 说明 | 前端 |
|
||||
|------|------|------|:---:|
|
||||
| `/api/v1/box/status` | GET | 可用性、Profile、后端信息 | ✅ 监控页 |
|
||||
| `/api/v1/box/sessions` | GET | 活跃 session 列表 | ❌ |
|
||||
| `/api/v1/box/errors` | GET | 最近 50 条错误 | ❌ |
|
||||
| `/api/v1/skills` 等 | GET/POST/PUT/DELETE | Skill CRUD、文件浏览、zip/GitHub 安装、preview | ✅ Skill 管理页 |
|
||||
|
||||
前端 `web/src/app/home/monitoring/components/overview-cards/SystemStatusCards.tsx` 已接入 `/api/v1/box/status`,展示 backend 名称、profile 与活跃 session 数。Sessions 与 errors API 仍未接入。
|
||||
76
docs/review/box-issues.md
Normal file
76
docs/review/box-issues.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Box 系统 — SaaS 发布前阻塞项
|
||||
|
||||
> 更新日期: 2026-06-02
|
||||
> 分支: `feat/sandbox` (LangBot + langbot-plugin-sdk)
|
||||
> 相关文档: [架构分析](./box-architecture.md) | [Session 作用域](./box-session-scope.md) | [Runtime 对比](./box-vs-plugin-runtime.md) | [测试覆盖](./box-test-coverage.md) | [toB 分析](./box-tob-analysis.md)
|
||||
|
||||
## 范围说明
|
||||
|
||||
**自部署社区版已具备发布条件**:默认 stdio 模式、box 为可选项;box 关闭 / 不可用时后端、前端、工具、skill、stdio-MCP 均能干净降级(清晰报错、不崩溃);配置向后兼容(旧 `data/config.yaml` 可直接启动);无新增 ORM 模型、无迁移欠债;市场安装失败不会破坏实例。CI 全绿。
|
||||
|
||||
本清单**只保留发布 SaaS / 多租户 / 公网暴露前必须处理的阻塞项**。社区版(可信、单运营者、内网)不受这些项阻塞——它们的风险面在"不可信调用方能直接触达 Box 控制面"或"多租户共享资源"的场景才成立。
|
||||
|
||||
## 已解决(社区版发布前)
|
||||
|
||||
| 项 | 处理 |
|
||||
|----|------|
|
||||
| 工具调用循环无上限 (原 #13) | `localagent.py` 增加 `MAX_TOOL_CALL_ROUNDS=128`,超限优雅终止(`cafef1a3`) |
|
||||
| 配额校验同步遍历阻塞事件循环 (原 #10) | `_enforce_workspace_quota` 改 async,工作区遍历走 `asyncio.to_thread`(`cafef1a3`) |
|
||||
| `host_path` 挂载白名单 (原 #3 的 LangBot 侧) | `pkg/box/service.py` `allowed_mount_roots` 白名单,空列表时拒绝一切宿主挂载 |
|
||||
| 重复的 `_is_path_under` (原 #12) | 已去重,仅保留一处定义 |
|
||||
| 重连 / 心跳 / Windows 兼容 / nsjail image 字段 / 前端 Box 状态接入 | 见上一轮 review 记录,均已合入 |
|
||||
|
||||
---
|
||||
|
||||
## SaaS 阻塞项
|
||||
|
||||
### S1. Box 控制面无认证 — Critical
|
||||
|
||||
- **位置**: SDK `box/server.py` — Action RPC WS (`/rpc/ws`) 与 managed-process relay (`/v1/sessions/{id}/managed-process/{pid}/ws`)
|
||||
- **现状**: 两个 WS handler 在 `ws.prepare` 后直接服务,无任何 token / 鉴权;box 默认绑定 `0.0.0.0:5410`。任何能触达该端口者可发起 `EXEC`、创建 session、attach 任意 session 的 managed-process stdin/stdout、甚至 `SHUTDOWN`。LangBot→box 的 INIT 也未下发任何凭证。
|
||||
- **缓解现状**: 默认 `docker-compose.yaml` 的 `langbot_box` 未把 5410 发布到宿主(爆炸半径限于内网 bridge);但 box 挂载了 `/var/run/docker.sock`,同网络的任意服务(含被攻破的插件)→ 宿主 root。若运营者把 5410 发布到宿主或独立以 `0.0.0.0` 起 box,则完全裸奔。
|
||||
- **要求**: INIT 时下发 token,两个 WS 路由按连接校验(query/header)。这是 SaaS 的**头号**阻塞项。
|
||||
|
||||
### S2. 无 exec 授权模型(policy.py 死代码) — High
|
||||
|
||||
- **位置**: LangBot `pkg/box/policy.py`(`SandboxPolicy` / `ToolPolicy` / `ElevatedPolicy` 全项目无引用);`pkg/provider/tools/loaders/native.py`;`pkg/provider/tools/toolmgr.py`
|
||||
- **现状**: 原生工具(`exec/read/write/edit/glob/grep`)按"box 是否可用"全有或全无地暴露,**无 per-pipeline 的 exec 网关 / 工具白名单 / 沙箱模式 / 权限提升控制**。只要 box 可用,任何使用 local-agent + 函数调用模型的 pipeline 都能跑任意 shell。
|
||||
- **要求**: 接入 policy.py(或等价机制),按 pipeline 控制是否暴露 `exec`、可用工具白名单、沙箱网络/只读模式。
|
||||
|
||||
### S3. 会话资源无界(DoS) — High
|
||||
|
||||
- **#5 session 数量无上限**: SDK `box/runtime.py` `_get_or_create_session` 的 `_sessions` dict 无容量限制——可变 `session_id` 的恶意调用可无限创建容器,耗尽宿主 CPU/内存/PID/磁盘。
|
||||
- **#8 无定时回收**: 过期 session 仅在 `_get_or_create_session` 时机会性清理,无独立周期任务;一波创建后转静默会永久泄漏容器。
|
||||
- **要求**: `max_sessions` 上限(拒绝或 LRU),加独立周期 reaper(如 60s)。
|
||||
|
||||
### S4. 工作区配额无内核级限制(TOCTOU) — Med-High
|
||||
|
||||
- **位置**: LangBot `pkg/box/service.py` `_enforce_workspace_quota`(应用层 read-then-check);SDK 侧 `workspace_quota_mb` 仅记录/透传,无 `--storage-opt size=` 等内核/FS 限额
|
||||
- **现状**: 执行前后两次检查之间存在竞态窗口;单条命令(`dd`/`fallocate`)可在检查间隙撑爆磁盘,事后检查只能补救。
|
||||
- **要求**: Docker `--storage-opt size=` 做内核级限制,或 Redis 原子计数预留式配额。
|
||||
|
||||
### S5. 挂载校验缺口 — Med-High
|
||||
|
||||
- **位置**: SDK `box/security.py` `_BLOCKED_HOST_PATHS_POSIX`;`box/backend.py` 的 `extra_mounts` 处理
|
||||
- **现状**: ① SDK 黑名单仍不含 `/`(前缀匹配,`host_path="/"` 可通过,挂载整个宿主 fs);用户 home、`/usr`、`/opt`、`/tmp` 也未拦截。② `validate_sandbox_security` 只校验 `spec.host_path`,**从不遍历 `spec.extra_mounts`**——LangBot 侧 `allowed_mount_roots` 也只校验 `host_path`。当前 `extra_mounts` 仅由 `build_skill_extra_mounts` 内部填充(agent 不可达),但缺乏纵深防御:一旦 S1 的无认证 RPC 被触达,extra_mounts 可挂任意宿主路径,两层都不拦。
|
||||
- **要求**: SDK 黑名单加入 `/`(或改白名单);`extra_mounts` 在 SDK 与 LangBot 两侧都纳入挂载校验。
|
||||
|
||||
### S6. 容器加固缺失 — Med
|
||||
|
||||
- **位置**: SDK `box/backend.py` 的 `docker run` 组装
|
||||
- **现状**: 未设置 `--cap-drop=ALL`、`--security-opt=no-new-privileges`、非 root `--user`;叠加挂载 docker.sock,逃逸面偏大。
|
||||
- **要求**: 默认加上上述加固 flag(需回归常用 skill 不被破坏)。
|
||||
|
||||
### S7. 全局锁内执行慢操作(扩展性) — Med
|
||||
|
||||
- **位置**: SDK `box/runtime.py` `_get_or_create_session`:`self._lock` 持有期间调用 `backend.start_session()`(`docker run` / nsjail 启动 / E2B `Sandbox.create`)
|
||||
- **影响**: 冷启动(镜像拉取数秒、E2B >1s)期间串行阻塞所有并发请求——多租户负载下整个 Box runtime 停顿。降级表现是延迟而非失败。
|
||||
- **要求**: 锁内只做状态检查与注册,容器创建移到锁外。
|
||||
|
||||
### S8. 其他硬化 / 跟进 — Low
|
||||
|
||||
- **#9** SDK `box/server.py` 直接读 `runtime._sessions` 私有字段、绕过锁,并发下可能读到不一致状态——应加公共访问方法。
|
||||
- **#16** `pkg/provider/tools/toolmgr.py` `execute_func_call` 按优先级分发,plugin/MCP 若有同名 `exec/read/write/...` 工具会被静默遮蔽——应加命名空间或冲突告警。
|
||||
- **#4** SDK `box/runtime.py` INIT/handshake 与 backend 实例化的残留竞态(仅"纯远程 WS box 先启动、LangBot 后连"场景成立;stdio/compose 路径下 config 经 env 在 spawn 时已就位,无竞态)——应在 INIT 完成前拒绝业务 action。
|
||||
- **#11** `extra_mounts` 在容器创建时固定(SDK `runtime.py` 兼容性检查不含 extra_mounts);长生命周期共享 session 后续新激活的 skill 不会挂上(当前缓解:创建时挂上 pipeline 绑定的全部 skill)——动态绑定场景需销毁重建或文档说明。
|
||||
- **#21** 集成测试未进 CI:容器实际执行、E2B 真机、managed-process WS attach 仅本地可跑。安全关键路径缺自动化覆盖——SaaS 前建议加 Docker-in-Docker CI stage 或合并前手动 checklist。
|
||||
402
docs/review/box-session-scope.md
Normal file
402
docs/review/box-session-scope.md
Normal file
@@ -0,0 +1,402 @@
|
||||
# Box Session Scope Design
|
||||
|
||||
> Date: 2026-04-18 (last reviewed 2026-06-02)
|
||||
> Status (2026-06-02): the self-hosted community edition is release-ready (box optional, clean degradation, no migration debt). Tool-call loop cap, async quota scan, and the host_path mount allowlist have landed. Remaining multi-tenant / security hardening is tracked in [box-issues.md](./box-issues.md).
|
||||
> Branch: `feat/sandbox` (LangBot + langbot-plugin-sdk)
|
||||
> Related: [Box Architecture](./box-architecture.md) | [Box vs Plugin Runtime](./box-vs-plugin-runtime.md)
|
||||
|
||||
---
|
||||
|
||||
## 0. Implementation Status (2026-05-19)
|
||||
|
||||
This document was authored as a design proposal. The current `feat/sandbox` branch
|
||||
has shipped the design largely as written:
|
||||
|
||||
| Item | Status | Notes |
|
||||
|------|--------|-------|
|
||||
| `BoxMountSpec` + `BoxSpec.extra_mounts` | ✅ Shipped | SDK `box/models.py` |
|
||||
| Docker / nsjail / E2B backends apply extra mounts | ✅ Shipped | Last gap closed by SDK commit `0fea9b1` (E2B) |
|
||||
| `box-session-id-template` in `local-agent` pipeline config | ✅ Shipped | `templates/metadata/pipeline/ai.yaml`, default `{launcher_type}_{launcher_id}` |
|
||||
| `BoxService.resolve_box_session_id(query)` | ✅ Shipped | `pkg/box/service.py:166` |
|
||||
| `BoxService.build_skill_extra_mounts(query)` | ✅ Shipped | `pkg/box/service.py:189` |
|
||||
| Skill exec uses unified container + extra mounts | ✅ Shipped | `pkg/provider/tools/loaders/native.py` skill branch |
|
||||
| MCP-in-Box uses shared persistent session, multi-process | ✅ Shipped (earlier than originally scoped) | SDK commit `529088e`, LangBot `mcp_stdio.py:_build_box_session_id` |
|
||||
| `BoxManagedProcessSpec.process_id` + multi-process per session | ✅ Shipped | `BoxRuntime` keeps `managed_processes: dict[pid, _ManagedProcess]` |
|
||||
| Per-tenant / quota integration with templates | ❌ Not started | See [box-tob-analysis.md](./box-tob-analysis.md) |
|
||||
|
||||
The "Phase 2 deferred" note in §10 is **out of date** — MCP unification went in on
|
||||
the same line. Pipeline-scoped (not user-scoped) MCP container is the realized
|
||||
behavior: each pipeline's MCP servers share one `mcp-<pipeline>` session, and
|
||||
user exec sessions use the template-derived id.
|
||||
|
||||
The remaining open work is multi-tenant overlays (tenant_id in session_id,
|
||||
quota counters keyed by tenant), tracked in the toB analysis doc rather than here.
|
||||
|
||||
---
|
||||
|
||||
## 1. Problems
|
||||
|
||||
### 1.1 Default exec: per-message containers
|
||||
|
||||
Currently, `BoxService.execute_tool()` sets `session_id = str(query.query_id)` — an
|
||||
auto-incrementing integer per incoming message. Every user message creates a new sandbox
|
||||
container. Dependencies installed and in-container state are lost between messages.
|
||||
|
||||
### 1.2 Three isolated container pools
|
||||
|
||||
Default exec, skills, and MCP servers each manage their own containers with
|
||||
independent session IDs:
|
||||
|
||||
| Path | Session ID | Container |
|
||||
|--------------|-----------------------------------------------|-------------|
|
||||
| Default exec | `str(query_id)` (per message) | Ephemeral |
|
||||
| Skill exec | `skill-{launcher}_{id}-{skill_name}` | Per skill |
|
||||
| MCP stdio | `mcp-{server_uuid}` | Per server |
|
||||
|
||||
This means a single logical user interaction can spawn 3+ containers that cannot
|
||||
share state, see each other's files, or reuse installed dependencies.
|
||||
|
||||
### 1.3 Single bind mount limitation
|
||||
|
||||
`BoxSpec` currently supports only **one** `host_path` → `mount_path` bind mount.
|
||||
This prevents mounting both a default workspace and skill directories into the
|
||||
same container.
|
||||
|
||||
---
|
||||
|
||||
## 2. Concept Model
|
||||
|
||||
```
|
||||
Platform Message
|
||||
→ Query (query_id: int, auto-increment, per message)
|
||||
→ Session (launcher_type + launcher_id, per chat window)
|
||||
→ Conversation (uuid, per dialogue context within a Session)
|
||||
```
|
||||
|
||||
| Concept | Key | Example | Scope |
|
||||
|---------------|-------------------------------------|----------------------------|------------------------------|
|
||||
| Query | `query_id` | `42` | Single message |
|
||||
| Session | `launcher_type` + `launcher_id` | `group_123456` | Chat window (group or PM) |
|
||||
| Conversation | `conversation_id` (UUID) | `a1b2c3d4-...` | Dialogue context within a Session |
|
||||
| Sender | `sender_id` | `789` | Individual user |
|
||||
|
||||
Note: in a **group chat**, all users share the same Session (keyed by `group_id`). The
|
||||
individual sender is tracked as `sender_id` but does not affect Session/Conversation routing.
|
||||
|
||||
---
|
||||
|
||||
## 3. Target Scenarios
|
||||
|
||||
| # | Scenario | Box Granularity | Desired `session_id` |
|
||||
|----|--------------------------------|------------------------------------------|---------------------------------------------------------|
|
||||
| 1 | Personal assistant | 1 Box per user, long-lived | `{launcher_type}_{launcher_id}` |
|
||||
| 2 | Customer service | 1 Box per customer, cross-pipeline | `{launcher_type}_{launcher_id}` |
|
||||
| 3 | Internal employee tool | 1 Box per employee | `{launcher_type}_{launcher_id}` |
|
||||
| 4 | Group chat shared assistant | 1 Box per group | `{launcher_type}_{launcher_id}` |
|
||||
| 5 | Group chat isolated per user | 1 Box per user within a group | `{launcher_type}_{launcher_id}_{sender_id}` |
|
||||
| 6 | Teaching (cross-channel) | 1 Box per student across groups/PMs | `{sender_id}` |
|
||||
| 7 | One-off execution | 1 Box per message (current behavior) | `{query_id}` |
|
||||
| 8 | Multi-project development | 1 Box per conversation context | `{launcher_type}_{launcher_id}_{conversation_id}` |
|
||||
|
||||
No single fixed granularity covers all scenarios. A template-based approach is needed.
|
||||
|
||||
---
|
||||
|
||||
## 4. Design Overview
|
||||
|
||||
Two key changes:
|
||||
|
||||
1. **Unified container**: exec, skills, and MCP all share the same container per
|
||||
session scope. No more separate container pools.
|
||||
2. **Configurable session scope**: `session_id` is generated from a template with
|
||||
pipeline variables, configurable per pipeline.
|
||||
|
||||
### 4.1 Unified Container with Multiple Mounts
|
||||
|
||||
A single container per session scope is created on first use. It has:
|
||||
|
||||
- **Primary mount**: default workspace at `/workspace` (from `default_host_workspace`)
|
||||
- **Skill mounts**: each pipeline-bound skill's `package_root` mounted at
|
||||
`/workspace/.skills/{skill_name}/`
|
||||
- **MCP servers**: run as managed processes inside the same container
|
||||
|
||||
```
|
||||
Container (session_id = "group_123456")
|
||||
/workspace/ ← default workspace (bind mount, rw)
|
||||
/workspace/.skills/web-search/ ← skill package (bind mount, rw)
|
||||
/workspace/.skills/data-analysis/ ← skill package (bind mount, rw)
|
||||
[managed process: mcp-server-a] ← MCP server running inside
|
||||
[managed process: mcp-server-b] ← MCP server running inside
|
||||
```
|
||||
|
||||
This requires extending `BoxSpec` to support multiple mounts (see §5).
|
||||
|
||||
### 4.2 Session ID Template
|
||||
|
||||
A new field `box-session-id-template` in the `local-agent` pipeline runner config
|
||||
controls the session scope:
|
||||
|
||||
```yaml
|
||||
# templates/metadata/pipeline/ai.yaml (under local-agent.config)
|
||||
- name: box-session-id-template
|
||||
label:
|
||||
en_US: Sandbox Scope
|
||||
zh_Hans: 沙箱作用域
|
||||
description:
|
||||
en_US: >-
|
||||
Determines how sandbox environments are shared. Use variables to
|
||||
control isolation granularity.
|
||||
zh_Hans: >-
|
||||
决定沙箱环境的共享方式。使用变量控制隔离粒度。
|
||||
type: select
|
||||
required: false
|
||||
default: "{launcher_type}_{launcher_id}"
|
||||
options:
|
||||
- value: "{launcher_type}_{launcher_id}"
|
||||
label:
|
||||
en_US: Per chat (Recommended)
|
||||
zh_Hans: 每个会话(推荐)
|
||||
- value: "{launcher_type}_{launcher_id}_{sender_id}"
|
||||
label:
|
||||
en_US: Per user in chat
|
||||
zh_Hans: 会话中每个用户
|
||||
- value: "{launcher_type}_{launcher_id}_{conversation_id}"
|
||||
label:
|
||||
en_US: Per conversation context
|
||||
zh_Hans: 每个对话上下文
|
||||
- value: "{query_id}"
|
||||
label:
|
||||
en_US: Per message (isolated)
|
||||
zh_Hans: 每条消息(完全隔离)
|
||||
```
|
||||
|
||||
Available template variables (populated by PreProcessor in `query.variables`):
|
||||
|
||||
| Variable | Source | Example |
|
||||
|---------------------|---------------------------------|----------------------|
|
||||
| `{launcher_type}` | `query.session.launcher_type` | `person` / `group` |
|
||||
| `{launcher_id}` | `query.session.launcher_id` | `123456` |
|
||||
| `{sender_id}` | `query.sender_id` | `789` |
|
||||
| `{conversation_id}` | `conversation.uuid` | `a1b2c3d4-...` |
|
||||
| `{query_id}` | `query.query_id` | `42` |
|
||||
|
||||
Default `{launcher_type}_{launcher_id}` covers scenarios 1–4 out of the box.
|
||||
|
||||
---
|
||||
|
||||
## 5. SDK Changes: Multi-Mount BoxSpec
|
||||
|
||||
### 5.1 Model Extension
|
||||
|
||||
```python
|
||||
# box/models.py
|
||||
|
||||
class BoxMountSpec(pydantic.BaseModel):
|
||||
"""A single bind mount specification."""
|
||||
host_path: str
|
||||
mount_path: str
|
||||
mode: BoxHostMountMode = BoxHostMountMode.READ_WRITE
|
||||
|
||||
class BoxSpec(pydantic.BaseModel):
|
||||
# ... existing fields ...
|
||||
host_path: str | None = None # Primary mount (backward compat)
|
||||
host_path_mode: BoxHostMountMode = BoxHostMountMode.READ_WRITE
|
||||
mount_path: str = DEFAULT_BOX_MOUNT_PATH
|
||||
extra_mounts: list[BoxMountSpec] = [] # NEW: additional mounts
|
||||
```
|
||||
|
||||
`extra_mounts` is additive — the existing `host_path` / `mount_path` pair remains
|
||||
the primary mount for backward compatibility.
|
||||
|
||||
### 5.2 Backend: Apply Extra Mounts
|
||||
|
||||
```python
|
||||
# box/backend.py — CLISandboxBackend.start_session()
|
||||
|
||||
# Primary mount (unchanged)
|
||||
if spec.host_path is not None and spec.host_path_mode != BoxHostMountMode.NONE:
|
||||
args.extend(['-v', f'{spec.host_path}:{spec.mount_path}:{spec.host_path_mode.value}'])
|
||||
|
||||
# Extra mounts (NEW)
|
||||
for mount in spec.extra_mounts:
|
||||
if mount.mode != BoxHostMountMode.NONE:
|
||||
args.extend(['-v', f'{mount.host_path}:{mount.mount_path}:{mount.mode.value}'])
|
||||
```
|
||||
|
||||
Same pattern for nsjail backend.
|
||||
|
||||
---
|
||||
|
||||
## 6. LangBot Changes
|
||||
|
||||
### 6.1 Session ID Resolution
|
||||
|
||||
In `BoxService.execute_tool()`:
|
||||
|
||||
```python
|
||||
# Before:
|
||||
spec_payload.setdefault('session_id', str(query.query_id))
|
||||
|
||||
# After:
|
||||
template = (query.pipeline_config or {}).get('ai', {}) \
|
||||
.get('local-agent', {}).get('box-session-id-template',
|
||||
'{launcher_type}_{launcher_id}')
|
||||
variables = query.variables or {}
|
||||
session_id = template.format_map(collections.defaultdict(
|
||||
lambda: 'unknown', variables
|
||||
))
|
||||
spec_payload.setdefault('session_id', session_id)
|
||||
```
|
||||
|
||||
### 6.2 Skill Exec: Use Same Container
|
||||
|
||||
Currently `native.py:_invoke_exec` creates a separate `BoxWorkspaceSession` per
|
||||
skill with `host_path=package_root`. Instead:
|
||||
|
||||
1. Use the **same session_id** as default exec (from the template).
|
||||
2. Pass the skill's `package_root` as an **extra mount** at
|
||||
`/workspace/.skills/{skill_name}/` instead of replacing `/workspace`.
|
||||
3. The container already has the default workspace at `/workspace`.
|
||||
|
||||
```python
|
||||
# native.py — _invoke_exec, skill branch (REVISED)
|
||||
|
||||
# Same session_id as default exec
|
||||
session_id = resolve_box_session_id(query)
|
||||
|
||||
spec_payload = {
|
||||
'cmd': rewritten_command,
|
||||
'workdir': rewritten_workdir,
|
||||
'session_id': session_id,
|
||||
'extra_mounts': [{
|
||||
'host_path': package_root,
|
||||
'mount_path': f'/workspace/.skills/{selected_skill_name}',
|
||||
'mode': 'rw',
|
||||
}],
|
||||
}
|
||||
result = await self.ap.box_service.execute_spec_payload(spec_payload, query)
|
||||
```
|
||||
|
||||
The virtual path `/workspace/.skills/{name}` no longer needs rewriting at the
|
||||
command level — it maps directly to the bind mount path inside the container.
|
||||
|
||||
### 6.3 MCP: Use Same Container
|
||||
|
||||
MCP servers should run inside the same container as exec and skills. Changes:
|
||||
|
||||
1. `BoxStdioSessionRuntime` uses the pipeline's session_id template instead of
|
||||
`mcp-{server_uuid}`.
|
||||
2. MCP server's working directory is a subdirectory (e.g. `/workspace/.mcp/{name}/`).
|
||||
3. MCP server's dependencies are mounted or installed into that subdirectory.
|
||||
4. The MCP server runs as a managed process inside the shared container.
|
||||
|
||||
Since MCP servers start at LangBot boot (not per-query), the session must be
|
||||
created eagerly. The container will be kept alive by the managed process
|
||||
exemption in TTL reaping (`runtime.py:259`).
|
||||
|
||||
**Note**: MCP sessions are pipeline-scoped (not per-launcher), so their session_id
|
||||
should be a **fixed identifier per pipeline** rather than the user-facing template.
|
||||
This means one shared MCP container per pipeline, with user exec sessions separate.
|
||||
|
||||
Alternatively, in a future iteration, MCP managed processes could be launched
|
||||
lazily into the user's container on first MCP tool call. This is more complex
|
||||
but maximizes sharing. For V1, keeping MCP containers at pipeline scope is
|
||||
simpler and more predictable.
|
||||
|
||||
---
|
||||
|
||||
## 7. Mount Layout Summary
|
||||
|
||||
### Default exec (no skills activated)
|
||||
|
||||
```
|
||||
Container (session_id from template)
|
||||
/workspace/ ← default_host_workspace (rw)
|
||||
```
|
||||
|
||||
### Exec with activated skills
|
||||
|
||||
```
|
||||
Container (same session_id)
|
||||
/workspace/ ← default_host_workspace (rw)
|
||||
/workspace/.skills/web-search/ ← skill package_root (rw)
|
||||
/workspace/.skills/data-analysis/ ← skill package_root (rw)
|
||||
```
|
||||
|
||||
Extra mounts are **additive** — they are added when the container is first
|
||||
created (or on the first exec that references a skill). Since Docker bind
|
||||
mounts are specified at container creation time, skills must be known at
|
||||
creation time.
|
||||
|
||||
**Resolution**: When creating a container, inject `extra_mounts` for **all
|
||||
pipeline-bound skills** (from `extensions_preferences`), not just the
|
||||
currently activated one. This way any skill can be activated later without
|
||||
recreating the container.
|
||||
|
||||
### MCP servers (V1: pipeline-scoped)
|
||||
|
||||
```
|
||||
Container (session_id = "mcp-pipeline-{pipeline_uuid}")
|
||||
/workspace/ ← MCP shared workspace
|
||||
/workspace/.mcp/server-a/ ← MCP server A files
|
||||
/workspace/.mcp/server-b/ ← MCP server B files
|
||||
[managed process: server-a]
|
||||
[managed process: server-b]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Data Migration
|
||||
|
||||
Existing pipelines do not have `box-session-id-template`. The backend uses
|
||||
`.get(..., default)` so missing keys fall back to `{launcher_type}_{launcher_id}`.
|
||||
This changes behavior from per-message to per-launcher for existing pipelines.
|
||||
|
||||
Recommendation: **accept the behavior change** — per-launcher is the more
|
||||
intuitive default, and the old per-message behavior was rarely desired.
|
||||
|
||||
---
|
||||
|
||||
## 9. Cloud Quota Implications
|
||||
|
||||
| Scope | Typical concurrent containers |
|
||||
|-----------------------------------------------|-------------------------------|
|
||||
| `{query_id}` (per message) | Many, short-lived |
|
||||
| `{launcher_type}_{launcher_id}` (per chat) | = active chat count |
|
||||
| `{sender_id}` (per user) | = active user count |
|
||||
| `{conversation_id}` (per conversation) | Between per-chat and per-msg |
|
||||
|
||||
With the unified container model, each scope value maps to exactly **one**
|
||||
container (instead of potentially 3+ per-message). This significantly reduces
|
||||
resource usage.
|
||||
|
||||
Quota enforcement point: `BoxRuntime._get_or_create_session()` in the SDK.
|
||||
|
||||
---
|
||||
|
||||
## 10. Implementation Phases
|
||||
|
||||
### Phase 1: Session scope + skill unification (this PR)
|
||||
|
||||
1. **SDK**: Extend `BoxSpec` with `extra_mounts: list[BoxMountSpec]`.
|
||||
2. **SDK**: Update Docker/nsjail backends to apply extra mounts.
|
||||
3. **LangBot**: Add `box-session-id-template` to `local-agent` YAML metadata
|
||||
and default pipeline config JSON.
|
||||
4. **LangBot**: Update `BoxService.execute_tool()` to use template interpolation.
|
||||
5. **LangBot**: Update `native.py:_invoke_exec` skill branch to use same
|
||||
session_id + extra mounts instead of separate `BoxWorkspaceSession`.
|
||||
6. **LangBot**: On container creation, inject extra mounts for all
|
||||
pipeline-bound skills.
|
||||
7. **Frontend**: No code change — `DynamicFormComponent` renders `select` fields.
|
||||
8. **Tests**: Unit tests for template interpolation and multi-mount specs.
|
||||
|
||||
### Phase 2: MCP unification (future)
|
||||
|
||||
1. Refactor `BoxStdioSessionRuntime` to use pipeline-scoped shared container.
|
||||
2. MCP servers become managed processes in the shared container.
|
||||
3. Support multiple concurrent managed processes per container.
|
||||
|
||||
MCP unification is deferred because it requires changes to the managed process
|
||||
model (currently 1 managed process per session) and has startup ordering
|
||||
concerns (MCP servers start at boot, before any user query determines
|
||||
a session_id).
|
||||
122
docs/review/box-test-coverage.md
Normal file
122
docs/review/box-test-coverage.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Box 系统测试覆盖分析
|
||||
|
||||
> 更新日期: 2026-06-02
|
||||
> 状态更新: 自部署社区版已具备发布条件(box 可选、降级完善、无迁移欠债);工具调用循环上限、配额遍历异步化、`host_path` 挂载白名单等已落地。剩余多租户 / 安全硬化项见 [SaaS 阻塞项清单](./box-issues.md)。
|
||||
> 分支: `feat/sandbox` (LangBot + langbot-plugin-sdk)
|
||||
|
||||
---
|
||||
|
||||
## 1. 测试文件清单
|
||||
|
||||
### LangBot 仓库
|
||||
|
||||
| 文件 | 行数 | CI 运行 | 覆盖范围 |
|
||||
|------|------|---------|---------|
|
||||
| `tests/unit_tests/box/test_box_connector.py` | 106 | 是 | Connector 传输决策、WS relay URL、dispose、心跳/重连 |
|
||||
| `tests/unit_tests/box/test_box_service.py` | 1224 | 是 | Service 核心逻辑(最全面) |
|
||||
| `tests/unit_tests/box/test_workspace.py` | 147 | 是 | WorkspaceSession 路径重写、payload 构建 |
|
||||
| `tests/unit_tests/provider/test_mcp_box_integration.py` | 707 | 是 | MCP Box 配置、路径重写、payload、shared-session/multi-process、runtime info |
|
||||
| `tests/unit_tests/provider/test_localagent_sandbox_exec.py` | 444 | 是 | LocalAgent exec 流程、流式、Skill 激活 (Tool Call) |
|
||||
| `tests/unit_tests/provider/test_tool_manager_native.py` | 249 | 是 | ToolManager 路由、native tool CRUD、路径穿越、6 工具暴露 |
|
||||
| `tests/unit_tests/provider/test_skill_tools.py` | 582 | 是 | Skill 管理、Tool Call 激活、路径、authoring CRUD |
|
||||
| `tests/unit_tests/test_skill_service.py` | 396 | 是 | HTTP service:skill CRUD、zip/GitHub install、文件浏览 |
|
||||
| `tests/unit_tests/test_paths.py` | 23 | 是 | paths 工具 |
|
||||
| `tests/unit_tests/test_preproc.py` | 134 | 是 | PreProcessor 注入 session 变量、bound skill 解析 |
|
||||
| `tests/unit_tests/pipeline/test_chat_handler_logging.py` | 78 | 是 | Chat handler 日志相关回归 |
|
||||
| `tests/integration_tests/box/test_box_integration.py` | 329 | **否** | 真实容器执行、超时、网络隔离 |
|
||||
| `tests/integration_tests/box/test_box_mcp_integration.py` | 368 | **否** | Managed process、WS attach、shared-session 清理 |
|
||||
|
||||
### SDK 仓库
|
||||
|
||||
| 文件 | 行数 | CI 运行 | 覆盖范围 |
|
||||
|------|------|---------|---------|
|
||||
| `tests/box/test_backend_selection.py` | 255 | 是 | 显式 backend / local 模式探测顺序 / 配置变更触发 reselect |
|
||||
| `tests/box/test_nsjail_backend.py` | 452 | 是 | nsjail 可用性、安装版 CLI vs 容器内 CLI、session、arg 构建、资源限制 |
|
||||
| `tests/box/test_e2b_backend.py` | 482 | 是 | E2B SDK mock、session 生命周期、extra_mounts 同步 |
|
||||
| `tests/box/test_skill_store.py` | 88 | 是 | zip preview/install、基础 file CRUD |
|
||||
|
||||
**总计**: 17 个测试文件, ~6,500 行测试代码; 其中 2 个集成测试(约 700 行)在 CI 中不运行。
|
||||
|
||||
> 较 2026-04-16 版增加:`test_skill_service.py`、`test_paths.py`、`test_preproc.py`、`test_chat_handler_logging.py` (LangBot),`test_backend_selection.py`、`test_e2b_backend.py`、`test_skill_store.py` (SDK)。`test_nsjail_backend.py` 增加 CLI 兼容性 case (commit `feed530`)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 覆盖良好的区域
|
||||
|
||||
| 区域 | 质量 | 说明 |
|
||||
|------|------|------|
|
||||
| BoxRuntime session 管理 | 优秀 | session 复用、冲突检测、TTL 配置、消失 session 重建 |
|
||||
| BoxService Profile 系统 | 优秀 | 4 个内置 Profile、locked/unlocked 字段、timeout clamp |
|
||||
| BoxService host mount 安全 | 优秀 | allowed_mount_roots、disallowed_roots、shared host root |
|
||||
| BoxService workspace quota | 优秀 | 前置/后置配额检查、超额清理 |
|
||||
| BoxService 输出截断 | 优秀 | 短/精确边界/长输出、独立 stderr |
|
||||
| BoxService 可观测性 | 优秀 | 状态报告、error ring buffer、buffer 上限 |
|
||||
| BoxService session 模板 | 良好 | `resolve_box_session_id` + `build_skill_extra_mounts` 在 service / native / mcp 三处都有覆盖 |
|
||||
| RPC client/server 协议 | 优秀 | execute/get_sessions/delete/create/conflict error |
|
||||
| BoxRuntimeConnector | 良好 | local/remote 模式、Docker 平台、relay URL、心跳与重连回调 |
|
||||
| BoxWorkspaceSession | 良好 | payload 构建、managed process 路径重写、stage host file |
|
||||
| BoxHostMountMode.NONE | 良好 | 枚举校验、workdir 约束 |
|
||||
| NsjailBackend | 良好 | 可用性、安装版 vs 容器内、session 生命周期、arg 构建、资源限制 |
|
||||
| E2BBackend | 良好 | mock SDK、session/extra_mounts 同步 |
|
||||
| Backend selection | 良好 | 显式 backend 优先级、local 探测顺序、配置变更触发 reselect |
|
||||
| MCP Box 集成 | 良好 | config model、路径重写、payload、shared-session 多 process |
|
||||
| Native tool loader | 良好 | 6 工具(exec/read/write/edit/glob/grep)、路径穿越拦截 |
|
||||
| LocalAgent exec 流程 | 良好 | 完整 tool call 循环、流式、system prompt 注入、Tool Call 激活 |
|
||||
| Skill 系统 | 良好 | 加载、Tool Call 激活、marker、路径解析、authoring CRUD、HTTP service |
|
||||
|
||||
---
|
||||
|
||||
## 3. 覆盖缺失的区域
|
||||
|
||||
### 3.1 零测试 / 严重不足
|
||||
|
||||
| 区域 | 源文件 | 影响 |
|
||||
|------|--------|------|
|
||||
| **`security.py`** | SDK `box/security.py` (52 行) | `validate_sandbox_security()` 无任何测试。阻止 `/etc`/`/proc`/Docker socket 等危险挂载的安全函数从未被验证 |
|
||||
| **`policy.py`** | `pkg/box/policy.py` (98 行) | 三层安全策略无测试(也是死代码) |
|
||||
| **`skill_store.py` 边缘场景** | SDK `box/skill_store.py` (647 行) vs 测试 88 行 | GitHub 安装路径、`source_subdir` / `target_suffix` 组合、损坏 zip、文件冲突等场景未覆盖 |
|
||||
|
||||
### 3.2 未测试的关键路径
|
||||
|
||||
| 区域 | 说明 |
|
||||
|------|------|
|
||||
| **Session TTL 过期** | 测试配置了 `session_ttl_sec` 但从未推进时间验证过期清理 |
|
||||
| **并发 session 访问** | 无并发 exec / 并发创建 / race condition 测试 |
|
||||
| **Container backend (Docker)** | 仅通过集成测试覆盖(CI 不运行),单元测试全用 FakeBackend |
|
||||
| **E2B 真实 sandbox** | 单测全是 mock,未对接真实 E2B API |
|
||||
| **BoxRuntime shutdown()** | 在 test cleanup 中调用但未验证行为 |
|
||||
| **BoxServerHandler 错误路径** | 畸形请求、未知 action 类型 |
|
||||
| **WS relay** | 仅在集成测试中覆盖(CI 不运行) |
|
||||
| **NsjailBackend managed process** | 完全未测试 |
|
||||
| **MCP stdio 完整生命周期** | 依赖安装 → 进程启动 → 健康检查 → 多 process 并发 → 重试 |
|
||||
| **BoxService start/stop_managed_process** | 单 process 流转有单测,多 process 互不阻塞主要靠集成测试 |
|
||||
| **重连指数退避** | connector 单测覆盖回调接线,未实际跑完整重连周期 |
|
||||
|
||||
### 3.3 边缘情况缺失
|
||||
|
||||
| 区域 | 说明 |
|
||||
|------|------|
|
||||
| BoxSpec 校验 | 无效 session_id 格式、超长命令、env 特殊字符 |
|
||||
| BoxSpec.extra_mounts | 重复 mount_path、与 host_path 冲突、绝对 vs 相对路径 |
|
||||
| BoxExecutionResult | 仅 COMPLETED 和 TIMED_OUT,无 ERROR 状态测试 |
|
||||
| 多后端 fallback | local 模式探测顺序仅靠 mock,无真实 Docker 不可用 → nsjail 真机 fallback 测试 |
|
||||
| Profile YAML 加载 | 测试用硬编码字符串,未从真实 config.yaml 加载 |
|
||||
| INIT 配置变更触发 backend 重建 | 单测仅在初始化场景验证 |
|
||||
|
||||
---
|
||||
|
||||
## 4. 集成测试 vs CI 的差距
|
||||
|
||||
CI 仅运行 `tests/unit_tests/`,以下场景**从未在自动化中验证**:
|
||||
|
||||
- 真实容器的创建/执行/销毁
|
||||
- 容器网络隔离(`--network none`)
|
||||
- 容器资源限制生效(cpus/memory/pids_limit)
|
||||
- Managed process 的 WS 双向 I/O
|
||||
- 多 process 同 session 并发 I/O
|
||||
- 孤儿容器清理
|
||||
- Session 删除清理容器
|
||||
- 进程退出检测
|
||||
- E2B 真实 sandbox 行为
|
||||
|
||||
**建议**: 在 CI 中加一个可选的 Docker-in-Docker 集成测试 stage,至少覆盖核心执行路径(exec / MCP attach / session 销毁)。
|
||||
167
docs/review/box-tob-analysis.md
Normal file
167
docs/review/box-tob-analysis.md
Normal file
@@ -0,0 +1,167 @@
|
||||
# Box 系统 toB 商业化分析
|
||||
|
||||
> 更新日期: 2026-06-02
|
||||
> 状态更新: 自部署社区版已具备发布条件(box 可选、降级完善、无迁移欠债);工具调用循环上限、配额遍历异步化、`host_path` 挂载白名单等已落地。剩余多租户 / 安全硬化项见 [SaaS 阻塞项清单](./box-issues.md)。
|
||||
> 分支: `feat/sandbox` (LangBot + langbot-plugin-sdk)
|
||||
|
||||
---
|
||||
|
||||
## 1. 现有优势
|
||||
|
||||
| 能力 | toB 价值 | 代码位置 |
|
||||
|------|---------|---------|
|
||||
| **沙箱隔离执行** | 企业安全运行不受信代码的基础能力 | SDK `box/backend.py` |
|
||||
| **多后端支持** | 适配不同企业容器基础设施 (Podman/Docker/nsjail/E2B) | SDK `box/runtime.py` `_select_backend()` |
|
||||
| **E2B 云沙箱** | SaaS / 无 Docker 部署的兜底执行环境 | SDK `box/e2b_backend.py` |
|
||||
| **连接自愈** | 心跳 + 自动重连,单点 Box runtime 故障可恢复 | `pkg/box/connector.py` `_heartbeat_loop`, `pkg/box/service.py` `_reconnect_loop` |
|
||||
| **Profile + locked 字段** | 运维锁定安全边界,LLM/用户无法绕过 | `pkg/box/service.py`, SDK `box/models.py` |
|
||||
| **资源限制** | CPU/内存/PID 数限制防止资源滥用 | SDK `backend.py` `--cpus/--memory/--pids-limit` |
|
||||
| **Workspace quota** | 磁盘用量控制 | `pkg/box/service.py` `_enforce_workspace_quota` |
|
||||
| **静默降级** | Box 不可用不影响其他功能,降低部署门槛 | `pkg/box/service.py:78` `_available=False` |
|
||||
| **孤儿容器清理** | 防止泄漏的容器持续占用资源 | SDK `backend.py` `cleanup_orphaned_containers` |
|
||||
| **网络隔离** | `--network none` 防止数据外泄 | SDK `backend.py` start_session |
|
||||
| **只读根文件系统** | `--read-only` 防止容器被持久篡改 | SDK `backend.py` start_session |
|
||||
| **Host path 白名单** | `allowed_host_mount_roots` 限制可挂载目录 | `pkg/box/service.py` `_validate_host_mount` |
|
||||
|
||||
---
|
||||
|
||||
## 2. toB 差距分析
|
||||
|
||||
### 2.1 安全与合规
|
||||
|
||||
| 维度 | 现状 | toB 要求 | 优先级 |
|
||||
|------|------|---------|--------|
|
||||
| **WS relay 认证** | 无认证,任何人可 attach | 至少 token 认证 | **P0** |
|
||||
| **安全策略** | policy.py 是死代码,实际无细粒度控制 | 工具级 allow/deny、沙箱模式控制 | **P0** |
|
||||
| **审计日志** | 仅内存中 50 条 `_recent_errors` | 持久化审计:谁何时执行了什么、结果如何 | **P0** |
|
||||
| **Host path 校验** | 黑名单策略,`/` 未拦截 | 白名单策略,默认拒绝 | **P1** |
|
||||
| **数据驻留** | 无控制 | GDPR / 等保要求的数据隔离 | **P2** |
|
||||
|
||||
### 2.2 多租户
|
||||
|
||||
| 维度 | 现状 | toB 要求 | 优先级 |
|
||||
|------|------|---------|--------|
|
||||
| **租户隔离** | 无租户概念 | BoxSpec/Profile 绑定 tenant_id | **P0** |
|
||||
| **RBAC** | 仅 token 认证 | admin/operator/viewer 角色权限 | **P0** |
|
||||
| **资源配额** | 单一 workspace quota | 每租户 CPU 时间/内存/并发/执行次数配额 | **P1** |
|
||||
| **Session 隔离** | 所有 session 共享 dict | 按租户分区,互不可见 | **P1** |
|
||||
|
||||
### 2.3 可靠性
|
||||
|
||||
| 维度 | 现状 | toB 要求 | 优先级 |
|
||||
|------|------|---------|--------|
|
||||
| **连接恢复** | 已实现:20s 心跳 + `_reconnect_loop` 指数退避 | 已满足基本要求 | 已有 |
|
||||
| **Session 清理** | 机会性(仅新建时触发) | 定时清理 + 独立 reaper | **P1** |
|
||||
| **水平扩展** | 单 Box Runtime 实例 | 多实例负载均衡(按 tenant 路由) | **P1** |
|
||||
| **优雅降级** | 已有(_available=False) | 已满足基本要求 | 已有 |
|
||||
| **Backend 自愈** | 已实现:`get_status` 时若 backend 不可用会重新选择 | 已满足基本要求 | 已有 |
|
||||
|
||||
### 2.4 可观测性
|
||||
|
||||
| 维度 | 现状 | toB 要求 | 优先级 |
|
||||
|------|------|---------|--------|
|
||||
| **监控指标** | 无 Prometheus metrics | session 数/执行延迟/资源用量/错误率 | **P1** |
|
||||
| **结构化日志** | Python logging, 无结构化 | JSON 格式日志,含 trace_id/tenant_id | **P1** |
|
||||
| **前端面板** | 监控页接入 `/api/v1/box/status`(backend 名 + 活跃 session 数);`sessions` / `errors` 仍未接入 | 完整状态面板 + 历史错误/审计列表 | **P2** |
|
||||
|
||||
---
|
||||
|
||||
## 3. SaaS 部署架构建议
|
||||
|
||||
### 3.1 方案 A: 共享 Box Runtime Pool (快速上线)
|
||||
|
||||
```
|
||||
LangBot Instance ──> Box Runtime (共享)
|
||||
├─ tenant_id 标签隔离
|
||||
├─ Redis 配额计数器
|
||||
└─ Container labels: langbot.tenant_id=xxx
|
||||
```
|
||||
|
||||
- **优点**: 改动最小,加 tenant_id 到 BoxSpec/labels 即可
|
||||
- **缺点**: 容器引擎共享,安全隔离弱
|
||||
|
||||
### 3.2 方案 B: 每租户 K8s Namespace + gVisor (推荐中期)
|
||||
|
||||
```
|
||||
LangBot ──> K8s API
|
||||
├─ namespace: tenant-xxx
|
||||
│ ├─ RuntimeClass: gVisor (runsc)
|
||||
│ ├─ ResourceQuota
|
||||
│ └─ NetworkPolicy
|
||||
└─ namespace: tenant-yyy
|
||||
└─ ...
|
||||
```
|
||||
|
||||
- **优点**: 强隔离(namespace + gVisor),原生 K8s 配额
|
||||
- **缺点**: 需要重写 backend 为 K8s Job,部署复杂度高
|
||||
|
||||
### 3.3 方案 C: K8s Job 直接编排 (长期)
|
||||
|
||||
```
|
||||
LangBot ──> K8s Job per execution
|
||||
├─ 每次执行创建 Job
|
||||
├─ Pod Security Standards
|
||||
├─ 自动调度和资源分配
|
||||
└─ Job TTL Controller 自动清理
|
||||
```
|
||||
|
||||
- **优点**: 最强隔离,天然水平扩展
|
||||
- **缺点**: 冷启动延迟,架构重写
|
||||
|
||||
**推荐演进路径**: A → B → C
|
||||
|
||||
---
|
||||
|
||||
## 4. 配额体系建议
|
||||
|
||||
### 三层配额
|
||||
|
||||
| 层 | 实现 | 作用 |
|
||||
|----|------|------|
|
||||
| **内核层** | Docker `--cpus`/`--memory`/`--storage-opt` | 硬性资源上限,不可绕过 |
|
||||
| **应用层** | Redis 原子计数器 | 并发 session 数/执行次数/CPU 时间预算 |
|
||||
| **计费层** | 月度聚合 | 按租户计费(session-hours/execution-count) |
|
||||
|
||||
### Profile 与套餐映射
|
||||
|
||||
| 套餐 | Profile | locked 字段 | 配额 |
|
||||
|------|---------|------------|------|
|
||||
| Free | `offline_readonly` | network, host_path_mode, rootfs | 10 exec/天, 0.5 CPU, 256MB |
|
||||
| Pro | `default` | (无) | 100 exec/天, 1 CPU, 512MB |
|
||||
| Enterprise | `network_extended` | (按需) | 无限, 2 CPU, 1GB, 自定义镜像 |
|
||||
|
||||
### TOCTOU 配额修复
|
||||
|
||||
当前 `_enforce_workspace_quota` 的 TOCTOU 问题可通过两种方式解决:
|
||||
|
||||
1. **预留式配额** (应用层): Redis `INCRBY` 预扣额度 → 执行 → 成功则扣减,失败则回滚
|
||||
2. **内核级限制** (Docker): `--storage-opt size=500m` 直接限制容器可写层大小
|
||||
|
||||
---
|
||||
|
||||
## 5. 优先实施路线
|
||||
|
||||
### Phase 1 (2-4 周): 安全基线
|
||||
|
||||
- [ ] WS relay 加 token 认证
|
||||
- [ ] 接入或删除 policy.py
|
||||
- [x] ~~Box 加重连和心跳~~(已完成,见 [box-issues.md 已解决](./box-issues.md))
|
||||
- [ ] 审计日志持久化(至少写文件/数据库)
|
||||
- [ ] `security.py` 加 `/` 拦截,考虑白名单
|
||||
- [ ] INIT 与 backend 初始化顺序整理(避免 backend 在配置到达前实例化)
|
||||
|
||||
### Phase 2 (4-8 周): 多租户基础
|
||||
|
||||
- [ ] BoxSpec 加 `tenant_id` 字段
|
||||
- [ ] 容器 labels 加 tenant 标识
|
||||
- [ ] Redis 配额计数器(并发/执行次数/时间)
|
||||
- [ ] RBAC 基础框架
|
||||
- [ ] 定时 session reaper
|
||||
|
||||
### Phase 3 (8-16 周): 生产就绪
|
||||
|
||||
- [ ] Prometheus metrics exporter
|
||||
- [ ] 前端 Box 状态面板
|
||||
- [ ] K8s backend 支持 (方案 B)
|
||||
- [ ] 结构化日志 (JSON, trace_id)
|
||||
- [ ] 水平扩展支持
|
||||
222
docs/review/box-vs-plugin-runtime.md
Normal file
222
docs/review/box-vs-plugin-runtime.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Box Runtime vs Plugin Runtime: 连接架构对比
|
||||
|
||||
> 更新日期: 2026-06-02
|
||||
> 状态更新: 自部署社区版已具备发布条件(box 可选、降级完善、无迁移欠债);工具调用循环上限、配额遍历异步化、`host_path` 挂载白名单等已落地。剩余多租户 / 安全硬化项见 [SaaS 阻塞项清单](./box-issues.md)。
|
||||
> 分支: `feat/sandbox` (LangBot + langbot-plugin-sdk)
|
||||
|
||||
---
|
||||
|
||||
## 1. 总体差异
|
||||
|
||||
| 维度 | Plugin Runtime | Box Runtime |
|
||||
|------|---------------|-------------|
|
||||
| **继承关系** | `PluginRuntimeConnector(ManagedRuntimeConnector)` | `BoxRuntimeConnector`(独立类) |
|
||||
| **传输分支** | 3 条 (Docker/WS, Win32/subprocess+WS, Unix/stdio) | 3 条 (本地 stdio, Win32/subprocess+WS, 远程 WS) |
|
||||
| **心跳** | 20s ping loop | 20s ping loop(`_heartbeat_loop`) |
|
||||
| **重连** | WS 模式: sleep 3s → re-initialize | 由 BoxService `_reconnect_loop` 处理,指数退避 |
|
||||
| **Handler 类型** | `RuntimeConnectionHandler` (1132 行, 25+ action) | 基础 `Handler` + `BoxServerHandler`(SDK 端 25 action) |
|
||||
| **Client 抽象** | Handler 即 API | 独立 `ActionRPCBoxClient` 封装 Handler |
|
||||
| **启用/禁用** | `is_enable_plugin` 开关 | 无开关(可用/不可用由初始化结果决定) |
|
||||
| **初始化失败** | 异常上抛 | 静默降级 `_available=False` |
|
||||
| **Shutdown** | 直接杀进程 | RPC SHUTDOWN → 清理容器 → 再杀进程 |
|
||||
|
||||
---
|
||||
|
||||
## 2. 传输决策
|
||||
|
||||
### Plugin: 3-路决策
|
||||
|
||||
```python
|
||||
# pkg/plugin/connector.py:106-165
|
||||
if get_platform() == 'docker' or use_websocket_to_connect_plugin_runtime():
|
||||
# Docker/WS → ws://langbot_plugin_runtime:5400/control/ws
|
||||
elif get_platform() == 'win32':
|
||||
# Windows → 起子进程(无 pipe) + ws://localhost:5400/control/ws
|
||||
else:
|
||||
# Unix/Mac → StdioClientController(python -m langbot_plugin.cli rt -s)
|
||||
```
|
||||
|
||||
### Box: 3-路决策
|
||||
|
||||
```python
|
||||
# pkg/box/connector.py
|
||||
if self._uses_websocket():
|
||||
if platform.get_platform() == 'win32' and not self.configured_runtime_url:
|
||||
await self._start_subprocess_then_ws() # subprocess + ws://localhost:5410/rpc/ws
|
||||
else:
|
||||
await self._connect_remote_ws() # ws://{host}:5410/rpc/ws
|
||||
else:
|
||||
await self._start_local_stdio() # StdioClientController
|
||||
```
|
||||
|
||||
> 历史:2026-04-16 版本本文档曾把 Box 描述为 2 路决策(缺 Windows 分支)。现已对齐 Plugin 的 3 路设计。
|
||||
|
||||
### 决策矩阵
|
||||
|
||||
| 环境 | Plugin | Box |
|
||||
|------|--------|-----|
|
||||
| Docker | WS → `:5400` | WS → `:5410/rpc/ws` |
|
||||
| `--standalone-box` | N/A | WS → `localhost:5410/rpc/ws` |
|
||||
| Windows 非 Docker | subprocess + WS (`:5400`) | subprocess + WS (`localhost:5410/rpc/ws`) |
|
||||
| Unix/Mac 非 Docker | stdio | stdio |
|
||||
| 手动配置 URL | 通过配置项 | WS → 用户配置的 URL |
|
||||
|
||||
---
|
||||
|
||||
## 3. 连接建立
|
||||
|
||||
### 同步模式差异
|
||||
|
||||
**Plugin**: `new_connection_callback` 内直接 ping + await handler_task,`initialize()` 通过 `create_task()` 异步启动,不阻塞等待连接。
|
||||
|
||||
**Box**: 使用 `asyncio.Event` + `wait_for(timeout=30s)` 模式,`initialize()` 同步等待连接成功或超时。
|
||||
|
||||
### Box stdio 路径
|
||||
|
||||
```
|
||||
connector._start_local_stdio()
|
||||
├─ connected = asyncio.Event()
|
||||
├─ ctrl = StdioClientController(python, ['-m', 'langbot_plugin.cli.__init__', 'box', '-s', '--ws-control-port', N])
|
||||
├─ _ctrl_task = create_task(ctrl.run(callback))
|
||||
│ callback:
|
||||
│ handler = Handler(connection) ← 基础 Handler, 无 disconnect_callback
|
||||
│ client.set_handler(handler)
|
||||
│ _handler_task = create_task(handler.run())
|
||||
│ call_action(PING, {}) ← 握手, timeout=15s
|
||||
│ connected.set() ← 通知外层
|
||||
│ await _handler_task ← 阻塞直到断开
|
||||
└─ await wait_for(connected.wait(), 30s) ← 同步等待
|
||||
```
|
||||
|
||||
### Plugin stdio 路径
|
||||
|
||||
```
|
||||
connector.initialize()
|
||||
├─ ctrl = StdioClientController(python, ['-m', 'langbot_plugin.cli', 'rt', '-s'])
|
||||
├─ task = ctrl.run(callback)
|
||||
│ callback:
|
||||
│ disconnect_callback:
|
||||
│ [WS] → runtime_disconnect_callback → 重连
|
||||
│ [stdio] → 仅日志, 不重连
|
||||
│ handler = RuntimeConnectionHandler(conn, disconnect_cb, ap)
|
||||
│ create_task(handler.run())
|
||||
│ handler.ping() ← 握手, timeout=10s
|
||||
│ await handler_task ← 阻塞直到断开
|
||||
├─ create_task(heartbeat_loop()) ← 20s ping loop
|
||||
└─ create_task(task) ← 不等待连接
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 心跳与重连
|
||||
|
||||
### 心跳
|
||||
|
||||
| 维度 | Plugin | Box |
|
||||
|------|--------|-----|
|
||||
| 有心跳? | 是 | 是(`connector.py` `_heartbeat_loop`) |
|
||||
| 间隔 | 20s | 20s |
|
||||
| 失败处理 | 仅 DEBUG 日志,不触发重连 | 仅 DEBUG 日志,依赖 connection close 触发重连 |
|
||||
| 生命周期 | 整个应用生命周期 | 连接建立后启动;`dispose()` 时 cancel |
|
||||
|
||||
### 重连
|
||||
|
||||
| 维度 | Plugin | Box |
|
||||
|------|--------|-----|
|
||||
| Docker/WS 断开 | `runtime_disconnect_callback` → sleep 3s → re-initialize | `runtime_disconnect_callback` → `BoxService._reconnect_loop()`(指数退避) |
|
||||
| WS 连接失败 | 同上 | 同上;初次失败时 `_available=False`,重连成功后恢复 |
|
||||
| stdio 断开 | 仅日志,不重连 | 接同样回调;stdio 重连需重新 fork 子进程 |
|
||||
| 重连退避 | 固定 3s,无 backoff | 指数退避 |
|
||||
|
||||
> 历史:2026-04-16 版本本文档曾把心跳与重连标记为 Box 缺失。这两项已在 commit `2dfd9d5d` / `c6882cf` / `5029d9c` 等修复(详见 [box-issues.md 已解决](./box-issues.md))。
|
||||
|
||||
---
|
||||
|
||||
## 5. 共享 IO 层
|
||||
|
||||
两者复用同一套 SDK IO 基础设施:
|
||||
|
||||
```
|
||||
Handler ← ABC (runtime/io/handler.py)
|
||||
├── RuntimeConnectionHandler (Plugin 用, LangBot 侧)
|
||||
├── ControlConnectionHandler (Plugin 用, SDK 侧)
|
||||
├── BoxServerHandler (Box 用, SDK 侧)
|
||||
└── 匿名 Handler 实例 (Box 用, LangBot 侧)
|
||||
|
||||
Connection ← ABC
|
||||
├── StdioConnection (stdio: 16KB chunks, 应用层分帧协议)
|
||||
└── WebSocketConnection (WS: 64KB chunks, 原生 WS 分帧)
|
||||
|
||||
Controller ← ABC
|
||||
├── StdioClientController (fork 子进程, pipe stdin/stdout)
|
||||
├── StdioServerController (接管当前进程 stdin/stdout)
|
||||
├── WebSocketClientController (连接 WS 服务端)
|
||||
└── WebSocketServerController (监听 WS 端口)
|
||||
```
|
||||
|
||||
共享的核心机制:
|
||||
- `call_action()` / `call_action_generator()` — RPC 调用/流式调用
|
||||
- `ActionRequest` / `ActionResponse` — 请求/响应协议
|
||||
- `seq_id` 关联 — 并发请求复用单连接
|
||||
- `CommonAction.PING` — 两者都用于初始握手
|
||||
- 文件传输 (`send_file`) — Plugin 用,Box 不用
|
||||
|
||||
---
|
||||
|
||||
## 6. 端口方案
|
||||
|
||||
| 服务 | Plugin | Box |
|
||||
|------|--------|-----|
|
||||
| Action RPC (stdio) | stdin/stdout | stdin/stdout |
|
||||
| Action RPC (WS) | `:5400` | `:5410/rpc/ws` |
|
||||
| 辅助服务 | debug WS `:5401` | managed process WS relay `:5410/v1/sessions/{id}/managed-process/ws` |
|
||||
|
||||
**Box 特点**: 单端口 aiohttp 服务(默认 5410),通过路径区分 Action RPC 和 managed process relay。即使在 stdio 模式,也在 `:5410` 启动 aiohttp 用于 managed process attach。Plugin 在 stdio 模式不开额外端口。
|
||||
|
||||
---
|
||||
|
||||
## 7. 销毁对比
|
||||
|
||||
### Plugin
|
||||
|
||||
```python
|
||||
dispose():
|
||||
if stdio: ctrl.process.terminate()
|
||||
_dispose_subprocess() # Windows 子进程
|
||||
heartbeat_task.cancel()
|
||||
```
|
||||
|
||||
### Box
|
||||
|
||||
```python
|
||||
connector.dispose():
|
||||
_handler_task.cancel()
|
||||
_ctrl_task.cancel()
|
||||
_subprocess.terminate()
|
||||
|
||||
service.dispose():
|
||||
connector.dispose()
|
||||
loop.create_task(client.shutdown()) # RPC SHUTDOWN → 清理所有容器
|
||||
```
|
||||
|
||||
Box 的 RPC SHUTDOWN 确保容器被正确停止,不会成为孤儿。Plugin 直接杀进程。
|
||||
|
||||
---
|
||||
|
||||
## 8. 改进建议
|
||||
|
||||
### P0
|
||||
|
||||
1. **两者都加 WS 认证**: 至少 token 认证(INIT 时下发,连接时校验)
|
||||
|
||||
### P1
|
||||
|
||||
2. **考虑 Box 继承 ManagedRuntimeConnector**: 复用 `_start_runtime_subprocess` / `_wait_until_ready` / `_dispose_subprocess`,减少重复代码
|
||||
3. **Plugin 重连加退避**: 固定 3s 无 backoff 可能造成日志洪水,建议向 Box 的指数退避看齐
|
||||
4. **统一连接管理模式**: Event-based (Box) vs direct-await (Plugin),考虑收敛为一种
|
||||
|
||||
### 已完成(自上一轮)
|
||||
|
||||
- ~~Box 加重连~~(commit `2dfd9d5d`)
|
||||
- ~~Box 加心跳~~(20s loop 与 Plugin 一致)
|
||||
- ~~Box 加 Windows 支持~~(commit `120817a` / `fafb7a4`)
|
||||
Reference in New Issue
Block a user