Files
LangBot/src/langbot/pkg/api/http/service/mcp.py
Junyan Chin 8e558ad3a1 Feat/saas sandbox adaptation (#2234)
* fix(box): trust Box-reported skill paths when filesystem is not shared

In separated deployments (Docker Compose, k8s sidecar, --standalone-box,
remote runtime.endpoint) the Box runtime owns its own filesystem, so the
skill package_root it reports via list_skills is not resolvable on the
LangBot side. LangBot's reload_skills and build_skill_extra_mounts
validated those paths with os.path.isdir() against its own filesystem,
which silently dropped every skill in such deployments — breaking the
sandbox skill feature for the nsjail/SaaS backend.

Add BoxService.shares_filesystem_with_box, derived from the connector
transport (stdio = shared, WebSocket = separated), with an explicit
override seam for tests/embedders. Gate both isdir() guards on it: keep
local validation in shared-fs stdio mode, trust Box-reported paths
otherwise. The Box runtime only reports skills found on its own
filesystem, so those paths are valid there by construction.

Adds topology-derivation tests (real connector, no mocks) and
skill-retention tests for both shared and separated filesystems.

* build(docker): ship a self-contained nsjail sandbox backend in the image

Compile nsjail 3.6 from source in a dedicated multi-stage build and carry
only the binary plus its runtime libs (libprotobuf32, libnl-route-3-200)
into the final image. This lets the Box runtime isolate sandboxed code via
nsjail user/mount/pid/net namespaces without a host Docker socket — the
prerequisite for running Box on LangBot Cloud (k8s), where mounting
docker.sock would grant node root and is not acceptable for multi-tenant.

The build toolchain (build-essential/bison/flex/protobuf-dev/libnl-dev)
stays in the nsjail-build stage and is not present in the shipped image.

Verified: image builds (583MB), nsjail --help exits 0, libraries resolve,
and the real NsjailBackend executes an isolated command end-to-end on a
v6.1/cgroup2 host matching LangBot Cloud prod (rlimit fallback path, since
container /sys/fs/cgroup is read-only; PID-namespace isolation confirmed).

* feat(box): SaaS guard to force a single global sandbox scope

Add system.limitation.force_box_session_id_template: when non-empty it
overrides every pipeline's box-session-id-template at resolve time, pinning
all queries to one shared sandbox (e.g. {global}). This is the authoritative,
unbypassable guard — it runs on every exec call, so editing the pipeline
config via API cannot escape it. The web UI locks the Sandbox Scope selector
via a combined box_scope_editable flag (box available AND not forced).

* build(deps): pin langbot-plugin==0.4.2b1 (nsjail cgroup container-safety beta)

* fix(web): show forced sandbox scope + make disabled tooltip tap-friendly

When a SaaS deployment pins every pipeline to a fixed sandbox scope via
system.limitation.force_box_session_id_template, the Sandbox Scope selector was
correctly locked but still displayed the pipeline's stored value (e.g. the
per-chat default), misrepresenting the scope that the runtime actually enforces
on every exec. Coerce the displayed/saved value to the forced template so the
locked selector truthfully shows the active scope (e.g. Global).

Also fix the disabled_tooltip being invisible on touch devices: hover-only Radix
tooltips never open without a pointer, so the explanation of why the field is
locked could not be read on mobile. Wrap the info icon so a tap toggles the
tooltip while desktop hover still works.

* feat(web): hide sidebar new-version prompt for edition=cloud

Cloud instances are upgraded centrally by the operator, so surfacing a GitHub
'new version available' badge to tenants is misleading and actionable only by
the operator. Skip the release check entirely when edition=cloud.

* style(web): prettier formatting for DisabledTooltipIcon ternary

* chore(deps): bump langbot-plugin to 0.4.2b2

Picks up the SDK fix that creates a read-write host_path before the
nsjail bind-mount, fixing the SaaS MCP shared-workspace sandbox failure
(exec exit 255 with empty output when host_path didn't exist).

* chore(deps): bump langbot-plugin to 0.4.2b3

Picks up the nsjail /dev-node fix so stdio MCP servers (uvx-launched) can
start under force_global_sandbox instead of failing with 'Connection closed
/ please check URL'.

* fix(web): show real MCP runtime status on installed extensions list

The installed-extensions list badge keyed solely off the enable flag, so a
server that was still CONNECTING (or in ERROR) was shown as 'Connected'.
Reflect the actual runtime_info.status (connecting/connected/error/disabled)
with matching colors, and poll quietly every 3s while any MCP server is
connecting so the badge transitions without a manual refresh.

* chore(deps): bump langbot-plugin to 0.4.2b4

Picks up the 30s start_managed_process timeout so cold uvx MCP bootstraps
don't get torn down mid-install.

* style(web): satisfy prettier — parenthesize nullish-coalescing in ternary

* fix(mcp): isolate transient test sessions from the shared Box session

A config-page 'test' (server_name='_', no persisted UUID) ran in the same
shared 'mcp-shared' Box session as live MCP servers. A failing test (e.g.
empty args) churned that shared session and tore down healthy, already-
connected servers — leaving them stuck after exhausting their retries.

Mark UUID-less sessions as transient, give them their own isolated Box
session ('mcp-test-<uuid>'), and fully delete that session on cleanup so
tests can never disturb live servers and don't leak sessions.

* fix(mcp): tear down transient test session after test completes

A successful config-page test left its isolated 'mcp-test-<uuid>' Box
session running (the lifecycle task blocks until shutdown). Wrap the
transient test coroutine so it always shuts the session down afterward,
preventing isolated test sessions from leaking.
2026-06-09 19:30:17 +08:00

183 lines
8.2 KiB
Python

from __future__ import annotations
import sqlalchemy
import uuid
import asyncio
from ....core import app
from ....entity.persistence import mcp as persistence_mcp
from ....core import taskmgr
from ....provider.tools.loaders.mcp import RuntimeMCPSession, MCPSessionStatus
class MCPService:
ap: app.Application
def __init__(self, ap: app.Application) -> None:
self.ap = ap
async def get_runtime_info(self, server_name: str) -> dict | None:
session = self.ap.tool_mgr.mcp_tool_loader.get_session(server_name)
if session:
return session.get_runtime_info_dict()
return None
async def get_mcp_servers(self, contain_runtime_info: bool = False) -> list[dict]:
result = await self.ap.persistence_mgr.execute_async(sqlalchemy.select(persistence_mcp.MCPServer))
servers = result.all()
serialized_servers = [
self.ap.persistence_mgr.serialize_model(persistence_mcp.MCPServer, server) for server in servers
]
if contain_runtime_info:
for server in serialized_servers:
runtime_info = await self.get_runtime_info(server['name'])
server['runtime_info'] = runtime_info if runtime_info else None
return serialized_servers
async def create_mcp_server(self, server_data: dict) -> str:
# Check limitation (extensions = MCP servers + plugins)
limitation = self.ap.instance_config.data.get('system', {}).get('limitation', {})
max_extensions = limitation.get('max_extensions', -1)
if max_extensions >= 0:
existing_mcp_servers = await self.get_mcp_servers()
plugins = await self.ap.plugin_connector.list_plugins()
total_extensions = len(existing_mcp_servers) + len(plugins)
if total_extensions >= max_extensions:
raise ValueError(f'Maximum number of extensions ({max_extensions}) reached')
server_data['uuid'] = str(uuid.uuid4())
await self.ap.persistence_mgr.execute_async(sqlalchemy.insert(persistence_mcp.MCPServer).values(server_data))
result = await self.ap.persistence_mgr.execute_async(
sqlalchemy.select(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.uuid == server_data['uuid'])
)
server_entity = result.first()
if server_entity:
server_config = self.ap.persistence_mgr.serialize_model(persistence_mcp.MCPServer, server_entity)
if self.ap.tool_mgr.mcp_tool_loader:
task = asyncio.create_task(self.ap.tool_mgr.mcp_tool_loader.host_mcp_server(server_config))
self.ap.tool_mgr.mcp_tool_loader._hosted_mcp_tasks.append(task)
return server_data['uuid']
async def get_mcp_server_by_name(self, server_name: str) -> dict | None:
result = await self.ap.persistence_mgr.execute_async(
sqlalchemy.select(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.name == server_name)
)
server = result.first()
if server is None:
return None
runtime_info = await self.get_runtime_info(server.name)
server_data = self.ap.persistence_mgr.serialize_model(persistence_mcp.MCPServer, server)
server_data['runtime_info'] = runtime_info if runtime_info else None
return server_data
async def update_mcp_server(self, server_uuid: str, server_data: dict) -> None:
result = await self.ap.persistence_mgr.execute_async(
sqlalchemy.select(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.uuid == server_uuid)
)
old_server = result.first()
old_server_name = old_server.name if old_server else None
old_enable = old_server.enable if old_server else False
await self.ap.persistence_mgr.execute_async(
sqlalchemy.update(persistence_mcp.MCPServer)
.where(persistence_mcp.MCPServer.uuid == server_uuid)
.values(server_data)
)
if self.ap.tool_mgr.mcp_tool_loader:
new_enable = server_data.get('enable', False)
need_remove = old_server_name and old_server_name in self.ap.tool_mgr.mcp_tool_loader.sessions
if old_enable and not new_enable:
if need_remove:
await self.ap.tool_mgr.mcp_tool_loader.remove_mcp_server(old_server_name)
elif not old_enable and new_enable:
result = await self.ap.persistence_mgr.execute_async(
sqlalchemy.select(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.uuid == server_uuid)
)
updated_server = result.first()
if updated_server:
server_config = self.ap.persistence_mgr.serialize_model(persistence_mcp.MCPServer, updated_server)
task = asyncio.create_task(self.ap.tool_mgr.mcp_tool_loader.host_mcp_server(server_config))
self.ap.tool_mgr.mcp_tool_loader._hosted_mcp_tasks.append(task)
elif old_enable and new_enable:
if need_remove:
await self.ap.tool_mgr.mcp_tool_loader.remove_mcp_server(old_server_name)
result = await self.ap.persistence_mgr.execute_async(
sqlalchemy.select(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.uuid == server_uuid)
)
updated_server = result.first()
if updated_server:
server_config = self.ap.persistence_mgr.serialize_model(persistence_mcp.MCPServer, updated_server)
task = asyncio.create_task(self.ap.tool_mgr.mcp_tool_loader.host_mcp_server(server_config))
self.ap.tool_mgr.mcp_tool_loader._hosted_mcp_tasks.append(task)
async def delete_mcp_server(self, server_uuid: str) -> None:
result = await self.ap.persistence_mgr.execute_async(
sqlalchemy.select(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.uuid == server_uuid)
)
server = result.first()
server_name = server.name if server else None
await self.ap.persistence_mgr.execute_async(
sqlalchemy.delete(persistence_mcp.MCPServer).where(persistence_mcp.MCPServer.uuid == server_uuid)
)
if server_name and self.ap.tool_mgr.mcp_tool_loader:
if server_name in self.ap.tool_mgr.mcp_tool_loader.sessions:
await self.ap.tool_mgr.mcp_tool_loader.remove_mcp_server(server_name)
async def test_mcp_server(self, server_name: str, server_data: dict) -> int:
"""测试 MCP 服务器连接并返回任务 ID"""
runtime_mcp_session: RuntimeMCPSession | None = None
if server_name != '_':
runtime_mcp_session = self.ap.tool_mgr.mcp_tool_loader.get_session(server_name)
if runtime_mcp_session is None:
raise ValueError(f'Server not found: {server_name}')
if runtime_mcp_session.status == MCPSessionStatus.ERROR:
coroutine = runtime_mcp_session.start()
else:
coroutine = runtime_mcp_session.refresh()
else:
runtime_mcp_session = await self.ap.tool_mgr.mcp_tool_loader.load_mcp_server(server_config=server_data)
# A transient test owns an isolated Box session. Always tear it down
# after the test completes (success or failure) so it does not leak.
test_session = runtime_mcp_session
async def _run_and_cleanup() -> None:
try:
await test_session.start()
finally:
try:
await test_session.shutdown()
except Exception as exc:
self.ap.logger.warning(
f'Failed to tear down transient MCP test session '
f'{test_session.server_name}: {type(exc).__name__}: {exc}'
)
coroutine = _run_and_cleanup()
ctx = taskmgr.TaskContext.new()
wrapper = self.ap.task_mgr.create_user_task(
coroutine,
kind='mcp-operation',
name=f'mcp-test-{server_name}',
label=f'Testing MCP server {server_name}',
context=ctx,
)
return wrapper.id