mirror of
https://github.com/langbot-app/LangBot.git
synced 2026-06-17 19:24:19 +00:00
Feat/rag plugin (#1995)
* [issue:1933] RAG engine plugin architecture (#1967) * refactor: migrate RAG knowledge services to a plugin-oriented host service architecture. * feat(rag): phase 2 core refactor with RPC Action handlers * feat: 为 RAG 插件添加知识库创建和删除事件通知,并优化了 RAG 动作的参数传递和枚举使用。 * feat: 统一知识库管理为RAG引擎,支持动态配置并移除旧的外部知识库组件。 * refactor(rag): remove plugin_adapter, inline logic into RuntimeKnowledgeBase BREAKING CHANGE: RAGPluginAdapter has been removed. All plugin communication is now handled directly by RuntimeKnowledgeBase. Architecture change: - Before: RuntimeKnowledgeBase → RAGPluginAdapter → plugin_connector - After: RuntimeKnowledgeBase → plugin_connector (direct) Changes to kbmgr.py (RuntimeKnowledgeBase): - Remove RAGPluginAdapter import and usage - Inline plugin communication methods: - _on_kb_create(): Notify plugin when KB is created - _on_kb_delete(): Notify plugin when KB is deleted - _ingest_document(): Call plugin for document ingestion - _retrieve(): Call plugin for retrieval - _delete_document(): Call plugin to delete document - Simplify dispose(): Only notify plugin, no built-in VDB assumption Changes to base.py (KnowledgeBaseInterface): - Remove get_type() abstract method (outdated internal/external concept) - Add get_rag_engine_plugin_id() abstract method Changes to localagent.py: - Remove get_type() call - Simplify top_k retrieval from KB entity Deleted files: - pkg/rag/knowledge/plugin_adapter.py Benefits: - Reduced abstraction layer, simpler code - Plugin communication logic centralized in RuntimeKnowledgeBase - Easier to understand and maintain 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(api): remove ExternalKnowledgeBase infrastructure BREAKING CHANGE: ExternalKnowledgeBase has been completely removed. All knowledge bases are now unified under the single KnowledgeBase model, differentiated by their rag_engine_plugin_id. Deleted files: - pkg/api/http/controller/groups/knowledge/external.py (ExternalKBController with /external-bases routes) - pkg/api/http/service/external_kb.py (ExternalKnowledgeBaseService) - pkg/rag/knowledge/external.py (ExternalKnowledgeBase implementation) Modified files: - pkg/entity/persistence/rag.py: Remove ExternalKnowledgeBase SQLAlchemy table definition - pkg/core/app.py: Remove external_kb_service attribute from LangBotApplication - pkg/core/stages/build_app.py: Remove external_kb_service initialization Migration notes: - Existing external knowledge base data should be migrated manually - API consumers should use /api/v1/knowledge/bases for all KB operations - Use /api/v1/knowledge/engines to discover available RAG engines 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(plugin): remove list_knowledge_retrievers from connector Remove deprecated list_knowledge_retrievers functionality from the plugin communication layer. This aligns with the SDK change that removed the LIST_KNOWLEDGE_RETRIEVERS action. Changes: - connector.py: Remove list_knowledge_retrievers() method - handler.py: Remove list_knowledge_retrievers() handler The functionality is replaced by the new /api/v1/knowledge/engines endpoint which lists available RAGEngine components with their capabilities and configuration schemas. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(service): update knowledge service with capability-based checks Replace type-based checks with capability-based checks for file operations, aligning with the unified knowledge base architecture. Changes to knowledge.py: - store_file(): Replace get_type() check with doc_ingestion capability check - delete_file(): Replace get_type() check with doc_ingestion capability check - list_rag_engines(): Remove list_knowledge_retrievers call, simplify to only list RAGEngine components (KnowledgeRetriever type removed) Changes to pipelines.py: - Minor cleanup related to knowledge base references The capability-based approach allows RAG engines to declare their supported features (doc_ingestion, chunking_config, rerank, hybrid_search) and the system responds accordingly, rather than hardcoding behavior based on internal/external type distinction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(web): unify knowledge base UI, remove external KB components BREAKING CHANGE: The internal/external knowledge base distinction has been removed from the frontend. All knowledge bases are now displayed in a unified list, differentiated by their RAG engine. Changes to page.tsx: - Remove Tab component (内置/外置 tabs) - Remove selectedKbType state - Unified knowledge base list display - Single "Create Knowledge Base" button for all types Changes to KBDetailDialog.tsx: - Remove kbType prop - Simplify dialog logic for unified KB handling - Documents menu item conditionally shown based on doc_ingestion capability Changes to KBForm.tsx: - Remove retriever type handling code - Simplify form for unified KB creation - Dynamic form rendering based on RAG engine's creation_schema Changes to KBCardVO.ts: - Remove 'type' field from KBCardVO interface Changes to BackendClient.ts: - Remove all external KB related methods: - getExternalKnowledgeBases() - getExternalKnowledgeBase() - createExternalKnowledgeBase() - updateExternalKnowledgeBase() - deleteExternalKnowledgeBase() - retrieveFromExternalKnowledgeBase() Changes to api/index.ts: - Remove ExternalKnowledgeBase interface definition UI/UX improvements: - Users no longer need to understand internal vs external distinction - RAG engine selection is now the primary differentiator - Documents panel visibility is capability-driven (doc_ingestion) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(plugin): code review improvements for RAG handlers - Unify embed_model field naming to embedding_model_uuid only - Add structured error responses with error_type for RAG actions - Fix file_size and mime_type detection in _store_file_task - Improve error handling with detailed error context (error_type, original_error) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(rag): refactor KB dynamic form and vector manager - Frontend: Refactor Knowledge Base form using DynamicForm components. - Frontend: Remove obsolete jsonSchemaConverter utility. - Backend: Update VectorManager and PluginHandler to support new RAG architecture. - Chore: Update dependencies in pyproject.toml. * fix: code review fixes for RAG refactor - Remove DEBUG stderr outputs in handler.py - Move repeated `import json` to file top - Add warning log for unimplemented delete_by_filter 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(rag): consolidate valid_fields into entity constants Define MUTABLE_FIELDS, CREATE_FIELDS, ALL_DB_FIELDS as class constants in KnowledgeBase entity to eliminate duplication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: 将知识库获取和RAG引擎信息丰富逻辑移至知识库管理器。 * refactor(rag): introduce RAGRuntimeService and clean up plugin handler - Create RAGRuntimeService to encapsulate RAG capability implementation (Embedding, VectorOps). - Refactor PluginHandler to delegate RAG actions to RAGRuntimeService. - Move KnowledgeService enrichment and creation logic to RAGManager. - Register RAGRuntimeService in Application and BuildAppStage. - Clean up legacy code in KnowledgeService. * refactor(rag): standardize logger and fix type hints - Use self.ap.logger consistently in kbmgr.py and runtime.py, removing module-level loggers. - Fix type hints for retrieve_knowledge in handler.py and connector.py to match implementation returning dict. * refactor: 将引擎徽章的样式从 Tailwind CSS 类迁移到 CSS 模块。 * fix(web): resolve React rendering errors in plugins page - Fix missing key prop in PluginComponentList by using ternary instead of Fragment - Fix RAGEngine.name type to I18nObject and use extractI18nObject() for rendering - Preserves multi-language support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(rag): update runtime service and web components * refactor: 优化知识库设置结构并增强前端距离显示健壮性。 * fix: 处理前端距离显示中的空值。 * fix(rag): document retrieve ui and kbmgr top_k validation * 更新 uv.lock 中的 PyPI 镜像源为官方地址。 * fix: address code review issues for RAG engine plugin architecture P0 fixes: - Fix ALL_DB_FIELDS missing collection_id and emoji fields - Move rag_engine_plugin_id to CREATE_FIELDS (immutable after creation) - Fix creation_settings mutable default value (dict -> None) - Rename vector delete method to delete_by_file_id for correct semantics - Fix delete_by_filter to raise NotImplementedError instead of silent no-op - Add database migration script (dbm019) for new columns and table cleanup P1 fixes: - Clean up design-hesitation comments in connector.py - Add _parse_plugin_id() with format validation for all RAG methods - Make _retrieve() raise exceptions instead of silently returning empty results - Extract _make_rag_error_response() helper for clean error formatting - Remove unused imports from handler.py P2 fixes: - Fix runtime.py indentation inconsistencies - Simplify get_file_stream to use storage abstraction uniformly - Reduce redundant DB queries in knowledge service (extract _check_doc_capability) - Fix engines.py URL encoding: use <path:plugin_id> instead of __ replacement - Add read-only mode for engine settings in KBForm edit mode - Simplify page.tsx handleKBCardClick to pass only kbId string Co-authored-by: Cursor <cursoragent@cursor.com> * fix: address code review findings for RAG plugin architecture - Frontend: add retrieval_settings param to retrieveKnowledgeBase API call - Backend: return {uuid} from PUT knowledge base to match frontend expectation - Backend: validate query is non-empty in retrieve endpoint (400 on empty) - Backend: rename vector_delete ids→file_ids for semantic clarity, keep backward compat by accepting both 'file_ids' and 'ids' in RPC handler - Backend: ensure rag_engine.name fallback is always I18nObject-compatible dict, preventing frontend extractI18nObject from receiving plain strings - Migration: fix misleading docstring about external_kb data migration Co-authored-by: Cursor <cursoragent@cursor.com> * Update langbot-plugin version to 0.2.6 * chore: update required database version from 18 to 19 * refactor: remove unused polymorphic component framework * chore: fix lint and format issues for python and frontend * fix(plugin): remove legacy `ids` fallback in rag_vector_delete handler SDK now sends `file_ids` directly, the `ids` backward-compat fallback is no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): deep review fixes for critical bugs, security and quality Critical: - Fix StorageMgr.load() -> storage_provider.load() (C1, AttributeError) - Update required_database_version 18 -> 19 (C2, migration never runs) Security: - Add path traversal validation in get_file_stream (C11) - Add vectors/ids/metadata length validation in rag_vector_upsert (C12) Logic fixes: - Legacy KBs: set capabilities to [] instead of ['doc_ingestion'] (C4) - Fix store_file return type int -> str (C5) - Fix retrieve_knowledge return [] -> {'results': []} when disabled (C6) - Re-raise exception in _on_kb_create instead of silently swallowing (C7) - Log warning when KB not found in memory during delete (C8) API fixes: - Catch ValueError as 400 in create_knowledge_base endpoint (C15) - Validate plugin_id format in engines endpoints (C16) Quality: - Remove dead if/else in migration with identical branches (C17) - Fix variable shadowing: rag_context -> rag_context_text (C18) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: remove unused os import to fix ruff lint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(plugin): remove PolymorphicComponent sync from LangBot side Remove sync_polymorphic_component_instances() from connector and handler, and the post-connection sync call in initialize(). This dead code synced an always-empty list of polymorphic instances that were never created. Companion change to langbot-plugin-sdk PolymorphicComponent removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): fix vector_delete count bug and remove vestigial instance_id parameter 1. vector_delete: assign return value from delete_by_filter to count instead of silently returning 0 for filter-based deletion. 2. Remove instance_id parameter from the entire retrieve_knowledge call chain (kbmgr → connector → handler → runtime). This parameter was a remnant of the PolymorphicComponent mechanism and is no longer used — RAGEngine operates as a stateless singleton. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(web): 支持 creation_schema 字段级别的 editable 属性控制编辑模式可修改性 - IDynamicFormItemSchema 添加 editable 可选属性 - DynamicFormItemConfig 透传 editable 属性 - DynamicFormComponent 接收 isEditing prop,按字段 editable 值控制禁用 - KBForm 解析 editable 并传递 isEditing 给动态表单组件 - editable 未指定时默认可编辑,editable: false 时编辑模式下禁用该字段 * feat(storage): 添加 size() 抽象方法及 LocalStorage/S3 实现 支持获取存储对象大小,S3 使用 head_object 避免下载整个文件 * fix(migration): 删除 external_knowledge_bases 表前记录日志警告 - 迁移时如果表中存在数据,先 warning 日志记录避免无感数据丢失 - 添加 chunk 清理注释说明:仅对旧版非插件架构 KB 有效 * fix(web): 修复检索结果长文本撑大容器导致查询按钮不可见 KBDetailDialog 的 main 容器添加 min-w-0 overflow-x-hidden, 限制 flex-1 子容器宽度,防止 Dify RAG 长文本撑出 Dialog 边界 * fix(rag): address code review issues for plugin architecture PR - Fix SQL injection in migration helpers by using bind parameters - Move numpy import to module level in vector/mgr.py - Improve path traversal validation using posixpath.normpath - Add call_rag_retrieve to connector, eliminating duplicate plugin_id parsing in kbmgr.py _retrieve - Normalize typing style to modern dict/list/None syntax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style(web): fix prettier formatting errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(rag): update embedding handling in RuntimeConnectionHandler - Renamed RAG_EMBED_DOCUMENTS and RAG_EMBED_QUERY actions to INVOKE_EMBEDDING for clarity. - Removed embed_documents and embed_query methods from RuntimeEmbeddingModel and RAGRuntimeService. - Integrated embedding model retrieval directly in the invoke_embedding method, improving error handling for missing models. - Updated the embedding invocation logic to streamline the process and enhance error reporting. * refactor(web): replace KnowledgeRetriever with RAGEngine across frontend and tests KnowledgeRetriever component type has been removed in favor of the new RAGEngine architecture. Update all remaining references in i18n locales, plugin component icon mappings, marketplace filter, and unit tests. Addresses reviewer notes from RockChinQ on PR #1967. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): address critical bugs found in deep review - Fix path traversal bypass in runtime.py (check all path components for '..') - Use normalized path for file loading instead of raw user input - Change knowledge_bases from list to dict for O(1) lookup and race safety - Add rollback on KB creation failure (clean up DB + runtime on plugin error) - Add null check after KB update in knowledge service - Fix file extension parsing to use os.path.splitext instead of split('.') (handles multi-dot filenames like 'report.v2.pdf' correctly) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): address remaining review issues across frontend and backend Frontend: - Fix KB delete: use async/await with error handling instead of fire-and-forget - Fix capabilities null check: add optional chaining to prevent crash - Add toast.error on KB info load failure instead of silent console.error - Replace hard-coded Chinese validation message with i18n key - Replace hard-coded English error messages in DynamicFormItemComponent with i18n - Optimize document polling: stop when all documents reach terminal state - Add i18n keys (fieldRequired, loadKnowledgeBaseFailed, deleteKnowledgeBaseFailed, getKnowledgeBaseListError) to all 4 locales Backend: - Fix KB delete atomicity: delete from DB first, then notify plugin - Add RAG engine plugin existence validation before creating KB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style(rag): fix ruff formatting in kbmgr.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Junyan Qin <rockchinq@gmail.com> * chore: bump langbot-plugin to 0.3.0 (#1992) * chore: correct sdk version to 0.3.0a1 * feat: normalize rag related actions' names * refactor(rag): align IngestionContext fields with SDK changes Remove redundant `chunking_strategy` field and rename `custom_settings` to `creation_settings` to match the updated SDK entity definitions (langbot-plugin-sdk#36). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): enforce immutability of embedding_model_uuid and non-editable creation_settings fields Remove embedding_model_uuid from MUTABLE_FIELDS to prevent post-creation modification via API. Add backend validation for creation_settings to preserve fields marked editable:false in the plugin's creation schema. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style(rag): fix ruff formatting in knowledge service Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(rag): split settings into immutable creation_settings and mutable retrieval_settings - Remove standalone embedding_model_uuid and top_k columns from KB entity - Add retrieval_settings column; update MUTABLE_FIELDS/CREATE_FIELDS accordingly - Merge migration logic into dbm019 (add retrieval_settings, migrate top_k and embedding_model_uuid into JSON settings, drop old columns on PostgreSQL) - Remove _filter_creation_settings and per-field editable concept - Frontend: creation_settings fields are all disabled when editing, retrieval_settings fields are always editable via a second DynamicFormComponent - Remove editable from IDynamicFormItemSchema, DynamicFormItemConfig - Clean up KBCardVO, KnowledgeBase API type, and localagent runner Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * bugfix: if ingest_document failed,not raise exep * fix: ruff lint * refactor(rag): remove unused _get_kb_entity method from RAGRuntimeService * feat(vector): implement metadata filters for vector_search and vector_delete (#1997) Add functional metadata filter support across all 5 VDB backends using Chroma-style where syntax as the canonical format. Previously the filters parameter existed throughout the stack but was entirely ignored. - Add filter_utils.py with normalize_filter() and strip_unsupported_fields() - Implement filter in search() and add delete_by_filter() for all backends: Chroma/SeekDB (native passthrough), Qdrant (translated to models.Filter), Milvus (translated to expr string), pgvector (translated to SQLAlchemy conditions) - Milvus/pgvector limited to {text, file_id, chunk_uuid}; other fields logged and ignored - Replace delete_by_filter() NotImplementedError with backend delegation in mgr.py - Populate retrieval_context['filters'] from settings in kbmgr._retrieve() - Pass search_type/query_text/documents through handler and runtime service Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style(vector): fix ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(vector): remove numpy dependency and fix SeekDB search modes - Remove numpy array conversion for query vectors; all VDB backends accept list[float] directly - Remove redundant get_or_create_collection call from upsert; backends handle collection creation internally in add_embeddings - Fix SeekDB to raise ValueError when vector dimension is unknown instead of defaulting to 384 - Use hybrid_search() for full-text and hybrid search modes in SeekDB, since pyseekdb's query() always requires embeddings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(vector): escape single quotes in SeekDB documents and metadata Document text containing apostrophes (e.g. "don't", "it's") causes SQL syntax errors in OceanBase because single quotes were not in the escape table. Add single-quote escaping and apply the escape table to the documents parameter in add_embeddings(), not just metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(vector): use standard SQL escaping for single quotes in SeekDB Change single quote escaping from MySQL-style \' to standard SQL '' (doubled quote). The backslash escape is not recognized by OceanBase in NO_BACKSLASH_ESCAPES mode, causing SQL syntax errors when metadata text contains apostrophes (e.g. O'Shea in academic citations). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): persist retrieval_settings on knowledge base creation retrieval_settings was not being passed from the service layer to RAGManager.create_knowledge_base(), causing retrieval schema fields (e.g. query_rewrite) to be lost on initial KB creation. They only took effect after a subsequent edit/update. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(web): add show_if conditional rendering for dynamic forms Support conditional field visibility in plugin-defined forms via show_if rules (eq, neq, in operators). Fields can depend on values from the same form or cross-reference between creation and retrieval settings via externalDependentValues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(rag): replace base64 with chunked file transfer for get_rag_file_stream Use send_file() instead of base64 encoding for returning file content in the GET_RAG_FILE_STREAM handler, avoiding memory issues with large files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(parser): add parser plugin integration and capability-aware upload UI (#2000) * feat(parser): add parser plugin integration and capability-aware upload UI Backend: add parser plugin API endpoints (list/invoke), connector and handler support for parser actions, and KB manager passthrough. Frontend: thread ragEngineCapabilities prop to FileUploadZone and use doc_parsing capability to conditionally show the RAG engine option in the parser selector. When no parser is available, show a warning prompting users to install a parser plugin. Update i18n: rename builtInParser to "Provided by RAG engine" and add noParserAvailable warning message in all 4 locales. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(parser): replace base64 with chunked file transfer and remove stale cache - Remove @alru_cache from list_parsers() and list_rag_engines() - Replace inline base64 file content with send_file/read_local_file chunked transfer pattern in parse_document and invoke_parser flows - Remove unused base64 import from kbmgr.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * feat(web): add Parser component kind to plugin market UI and i18n Add Parser to kindIconMap, market filter toggle, and all 4 locale files so parser plugins are properly displayed and filterable in the plugin market, matching the existing RAGEngine treatment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style(web): fix prettier formatting from merge Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: rename RAGEngine to KnowledgeEngine across frontend and backend * fix(web): fix I18nObject import path in FileUploadZone and KBDoc * chore: format files involved in RAGEngine to KnowledgeEngine refactor * refactor: change rag engine to knowledge engine * fix: update langbot-plugin version to 0.3.0rc1 * chore: disable migration 20 for now --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Junyan Qin <rockchinq@gmail.com>
This commit is contained in:
@@ -22,12 +22,12 @@ class KnowledgeBaseInterface(metaclass=abc.ABCMeta):
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
async def retrieve(self, query: str, top_k: int) -> list[rag_context.RetrievalResultEntry]:
|
||||
async def retrieve(self, query: str, settings: dict | None = None) -> list[rag_context.RetrievalResultEntry]:
|
||||
"""Retrieve relevant documents from the knowledge base
|
||||
|
||||
Args:
|
||||
query: The query string
|
||||
top_k: Number of top results to return
|
||||
settings: Optional per-request retrieval settings overrides
|
||||
|
||||
Returns:
|
||||
List of retrieve result entries
|
||||
@@ -45,8 +45,8 @@ class KnowledgeBaseInterface(metaclass=abc.ABCMeta):
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def get_type(self) -> str:
|
||||
"""Get the type of knowledge base (internal/external)"""
|
||||
def get_knowledge_engine_plugin_id(self) -> str:
|
||||
"""Get the Knowledge Engine plugin ID"""
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
|
||||
@@ -1,85 +0,0 @@
|
||||
"""External knowledge base implementation"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from langbot.pkg.core import app
|
||||
from langbot.pkg.entity.persistence import rag as persistence_rag
|
||||
from langbot_plugin.api.entities.builtin.rag import context as rag_context
|
||||
from .base import KnowledgeBaseInterface
|
||||
|
||||
|
||||
class ExternalKnowledgeBase(KnowledgeBaseInterface):
|
||||
"""External knowledge base that queries via HTTP API or plugin retriever"""
|
||||
|
||||
external_kb_entity: persistence_rag.ExternalKnowledgeBase
|
||||
|
||||
# Plugin retriever instance ID
|
||||
retriever_instance_id: str | None
|
||||
|
||||
def __init__(self, ap: app.Application, external_kb_entity: persistence_rag.ExternalKnowledgeBase):
|
||||
super().__init__(ap)
|
||||
self.external_kb_entity = external_kb_entity
|
||||
self.retriever_instance_id = None
|
||||
|
||||
async def initialize(self):
|
||||
"""Initialize the external knowledge base"""
|
||||
# Use KB UUID as instance ID
|
||||
# Instance creation is now handled by the unified sync mechanism
|
||||
# when LangBot connects to runtime
|
||||
self.retriever_instance_id = self.external_kb_entity.uuid
|
||||
|
||||
self.ap.logger.info(
|
||||
f'Initialized external KB {self.external_kb_entity.uuid}, instance will be created by sync mechanism'
|
||||
)
|
||||
|
||||
async def retrieve(self, query: str, top_k: int = 5) -> list[rag_context.RetrievalResultEntry]:
|
||||
"""Retrieve documents from external knowledge base via plugin retriever"""
|
||||
if not self.retriever_instance_id:
|
||||
self.ap.logger.error(f'No retriever instance for KB {self.external_kb_entity.uuid}')
|
||||
return []
|
||||
|
||||
try:
|
||||
results = await self.ap.plugin_connector.retrieve_knowledge(
|
||||
self.external_kb_entity.plugin_author,
|
||||
self.external_kb_entity.plugin_name,
|
||||
self.external_kb_entity.retriever_name,
|
||||
self.retriever_instance_id,
|
||||
{'query': query},
|
||||
)
|
||||
|
||||
# Convert plugin results to RetrievalResultEntry
|
||||
retrieval_entries = []
|
||||
for result in results:
|
||||
retrieval_entries.append(rag_context.RetrievalResultEntry(**result))
|
||||
|
||||
return retrieval_entries
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Plugin retriever error: {e}')
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
return []
|
||||
|
||||
def get_uuid(self) -> str:
|
||||
"""Get the UUID of the external knowledge base"""
|
||||
return self.external_kb_entity.uuid
|
||||
|
||||
def get_name(self) -> str:
|
||||
"""Get the name of the external knowledge base"""
|
||||
return self.external_kb_entity.name
|
||||
|
||||
def get_type(self) -> str:
|
||||
"""Get the type of knowledge base"""
|
||||
return 'external'
|
||||
|
||||
async def dispose(self):
|
||||
"""Clean up resources"""
|
||||
# Trigger sync to immediately delete the instance from plugin process
|
||||
# This ensures instance is cleaned up without waiting for next LangBot restart
|
||||
try:
|
||||
await self.ap.plugin_connector.sync_polymorphic_component_instances()
|
||||
self.ap.logger.info(
|
||||
f'Disposed external KB {self.external_kb_entity.uuid}, triggered sync to delete instance'
|
||||
)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Failed to sync after disposing KB: {e}')
|
||||
@@ -1,18 +1,19 @@
|
||||
from __future__ import annotations
|
||||
import mimetypes
|
||||
import os.path
|
||||
import traceback
|
||||
import uuid
|
||||
import zipfile
|
||||
import io
|
||||
from .services import parser, chunker
|
||||
from typing import Any
|
||||
from langbot.pkg.core import app
|
||||
from langbot.pkg.rag.knowledge.services.embedder import Embedder
|
||||
from langbot.pkg.rag.knowledge.services.retriever import Retriever
|
||||
import sqlalchemy
|
||||
|
||||
|
||||
from langbot.pkg.entity.persistence import rag as persistence_rag
|
||||
from langbot.pkg.core import taskmgr
|
||||
from langbot_plugin.api.entities.builtin.rag import context as rag_context
|
||||
from .base import KnowledgeBaseInterface
|
||||
from .external import ExternalKnowledgeBase
|
||||
|
||||
|
||||
class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
@@ -20,28 +21,16 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
|
||||
knowledge_base_entity: persistence_rag.KnowledgeBase
|
||||
|
||||
parser: parser.FileParser
|
||||
|
||||
chunker: chunker.Chunker
|
||||
|
||||
embedder: Embedder
|
||||
|
||||
retriever: Retriever
|
||||
|
||||
def __init__(self, ap: app.Application, knowledge_base_entity: persistence_rag.KnowledgeBase):
|
||||
super().__init__(ap)
|
||||
self.knowledge_base_entity = knowledge_base_entity
|
||||
self.parser = parser.FileParser(ap=self.ap)
|
||||
self.chunker = chunker.Chunker(ap=self.ap)
|
||||
self.embedder = Embedder(ap=self.ap)
|
||||
self.retriever = Retriever(ap=self.ap)
|
||||
# 传递kb_id给retriever
|
||||
self.retriever.kb_id = knowledge_base_entity.uuid
|
||||
|
||||
async def initialize(self):
|
||||
pass
|
||||
|
||||
async def _store_file_task(self, file: persistence_rag.File, task_context: taskmgr.TaskContext):
|
||||
async def _store_file_task(
|
||||
self, file: persistence_rag.File, task_context: taskmgr.TaskContext, parser_plugin_id: str | None = None
|
||||
):
|
||||
try:
|
||||
# set file status to processing
|
||||
await self.ap.persistence_mgr.execute_async(
|
||||
@@ -50,31 +39,46 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
.values(status='processing')
|
||||
)
|
||||
|
||||
task_context.set_current_action('Parsing file')
|
||||
# parse file
|
||||
text = await self.parser.parse(file.file_name, file.extension)
|
||||
if not text:
|
||||
raise Exception(f'No text extracted from file {file.file_name}')
|
||||
task_context.set_current_action('Processing file')
|
||||
|
||||
task_context.set_current_action('Chunking file')
|
||||
# chunk file
|
||||
chunks_texts = await self.chunker.chunk(text)
|
||||
if not chunks_texts:
|
||||
raise Exception(f'No chunks extracted from file {file.file_name}')
|
||||
# Get file size from storage
|
||||
file_size = await self.ap.storage_mgr.storage_provider.size(file.file_name)
|
||||
|
||||
task_context.set_current_action('Embedding chunks')
|
||||
# Detect MIME type from extension
|
||||
mime_type, _ = mimetypes.guess_type(file.file_name)
|
||||
if mime_type is None:
|
||||
mime_type = 'application/octet-stream'
|
||||
|
||||
embedding_model = await self.ap.model_mgr.get_embedding_model_by_uuid(
|
||||
self.knowledge_base_entity.embedding_model_uuid
|
||||
)
|
||||
# embed chunks
|
||||
await self.embedder.embed_and_store(
|
||||
kb_id=self.knowledge_base_entity.uuid,
|
||||
file_id=file.uuid,
|
||||
chunks=chunks_texts,
|
||||
embedding_model=embedding_model,
|
||||
# If a parser plugin is specified, call it before ingestion
|
||||
parsed_content = None
|
||||
if parser_plugin_id:
|
||||
task_context.set_current_action('Parsing file')
|
||||
file_bytes = await self.ap.storage_mgr.storage_provider.load(file.file_name)
|
||||
parse_context = {
|
||||
'mime_type': mime_type,
|
||||
'filename': file.file_name,
|
||||
'metadata': {},
|
||||
}
|
||||
parsed_content = await self.ap.plugin_connector.call_parser(parser_plugin_id, parse_context, file_bytes)
|
||||
|
||||
# Call plugin to ingest document
|
||||
result = await self._ingest_document(
|
||||
{
|
||||
'document_id': file.uuid,
|
||||
'filename': file.file_name,
|
||||
'extension': file.extension,
|
||||
'file_size': file_size,
|
||||
'mime_type': mime_type,
|
||||
},
|
||||
file.file_name, # storage path
|
||||
parsed_content=parsed_content,
|
||||
)
|
||||
|
||||
# Check plugin result status
|
||||
if result.get('status') == 'failed':
|
||||
error_msg = result.get('error_message', 'Plugin ingestion returned failed status')
|
||||
raise Exception(error_msg)
|
||||
|
||||
# set file status to completed
|
||||
await self.ap.persistence_mgr.execute_async(
|
||||
sqlalchemy.update(persistence_rag.File)
|
||||
@@ -97,16 +101,17 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
# delete file from storage
|
||||
await self.ap.storage_mgr.storage_provider.delete(file.file_name)
|
||||
|
||||
async def store_file(self, file_id: str) -> str:
|
||||
async def store_file(self, file_id: str, parser_plugin_id: str | None = None) -> str:
|
||||
# pre checking
|
||||
if not await self.ap.storage_mgr.storage_provider.exists(file_id):
|
||||
raise Exception(f'File {file_id} not found')
|
||||
|
||||
file_name = file_id
|
||||
extension = file_name.split('.')[-1].lower()
|
||||
_, ext = os.path.splitext(file_name)
|
||||
extension = ext.lstrip('.').lower() if ext else ''
|
||||
|
||||
if extension == 'zip':
|
||||
return await self._store_zip_file(file_id)
|
||||
return await self._store_zip_file(file_id, parser_plugin_id=parser_plugin_id)
|
||||
|
||||
file_uuid = str(uuid.uuid4())
|
||||
kb_id = self.knowledge_base_entity.uuid
|
||||
@@ -126,7 +131,7 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
# run background task asynchronously
|
||||
ctx = taskmgr.TaskContext.new()
|
||||
wrapper = self.ap.task_mgr.create_user_task(
|
||||
self._store_file_task(file_obj, task_context=ctx),
|
||||
self._store_file_task(file_obj, task_context=ctx, parser_plugin_id=parser_plugin_id),
|
||||
kind='knowledge-operation',
|
||||
name=f'knowledge-store-file-{file_id}',
|
||||
label=f'Store file {file_id}',
|
||||
@@ -134,7 +139,7 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
)
|
||||
return wrapper.id
|
||||
|
||||
async def _store_zip_file(self, zip_file_id: str) -> str:
|
||||
async def _store_zip_file(self, zip_file_id: str, parser_plugin_id: str | None = None) -> str:
|
||||
"""Handle ZIP file by extracting each document and storing them separately."""
|
||||
self.ap.logger.info(f'Processing ZIP file: {zip_file_id}')
|
||||
|
||||
@@ -150,7 +155,8 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
if file_info.is_dir() or file_info.filename.startswith('.'):
|
||||
continue
|
||||
|
||||
file_extension = file_info.filename.split('.')[-1].lower()
|
||||
_, file_ext = os.path.splitext(file_info.filename)
|
||||
file_extension = file_ext.lstrip('.').lower()
|
||||
if file_extension not in supported_extensions:
|
||||
self.ap.logger.debug(f'Skipping unsupported file in ZIP: {file_info.filename}')
|
||||
continue
|
||||
@@ -159,18 +165,18 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
file_content = zip_ref.read(file_info.filename)
|
||||
|
||||
base_name = file_info.filename.replace('/', '_').replace('\\', '_')
|
||||
extension = base_name.split('.')[-1]
|
||||
file_name = base_name.split('.')[0]
|
||||
file_stem, file_ext = os.path.splitext(base_name)
|
||||
extension = file_ext.lstrip('.')
|
||||
|
||||
if file_name.startswith('__MACOSX'):
|
||||
if file_stem.startswith('__MACOSX'):
|
||||
continue
|
||||
|
||||
extracted_file_id = file_name + '_' + str(uuid.uuid4())[:8] + '.' + extension
|
||||
extracted_file_id = file_stem + '_' + str(uuid.uuid4())[:8] + '.' + extension
|
||||
# save file to storage
|
||||
|
||||
await self.ap.storage_mgr.storage_provider.save(extracted_file_id, file_content)
|
||||
|
||||
task_id = await self.store_file(extracted_file_id)
|
||||
task_id = await self.store_file(extracted_file_id, parser_plugin_id=parser_plugin_id)
|
||||
stored_file_tasks.append(task_id)
|
||||
|
||||
self.ap.logger.info(
|
||||
@@ -189,21 +195,28 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
|
||||
return stored_file_tasks[0] if stored_file_tasks else ''
|
||||
|
||||
async def retrieve(self, query: str, top_k: int) -> list[rag_context.RetrievalResultEntry]:
|
||||
embedding_model = await self.ap.model_mgr.get_embedding_model_by_uuid(
|
||||
self.knowledge_base_entity.embedding_model_uuid
|
||||
)
|
||||
return await self.retriever.retrieve(self.knowledge_base_entity.uuid, query, embedding_model, top_k)
|
||||
async def retrieve(self, query: str, settings: dict | None = None) -> list[rag_context.RetrievalResultEntry]:
|
||||
# Merge stored retrieval_settings with per-request overrides
|
||||
stored = self.knowledge_base_entity.retrieval_settings or {}
|
||||
merged = {**stored, **(settings or {})}
|
||||
if 'top_k' not in merged:
|
||||
merged['top_k'] = 5 # fallback default
|
||||
|
||||
response = await self._retrieve(query, merged)
|
||||
|
||||
results_data = response.get('results', [])
|
||||
entries = []
|
||||
for r in results_data:
|
||||
if isinstance(r, dict):
|
||||
entries.append(rag_context.RetrievalResultEntry(**r))
|
||||
elif isinstance(r, rag_context.RetrievalResultEntry):
|
||||
entries.append(r)
|
||||
return entries
|
||||
|
||||
async def delete_file(self, file_id: str):
|
||||
# delete vector
|
||||
await self.ap.vector_db_mgr.vector_db.delete_by_file_id(self.knowledge_base_entity.uuid, file_id)
|
||||
|
||||
# delete chunk
|
||||
await self.ap.persistence_mgr.execute_async(
|
||||
sqlalchemy.delete(persistence_rag.Chunk).where(persistence_rag.Chunk.file_id == file_id)
|
||||
)
|
||||
await self._delete_document(file_id)
|
||||
|
||||
# Also cleanup DB record
|
||||
await self.ap.persistence_mgr.execute_async(
|
||||
sqlalchemy.delete(persistence_rag.File).where(persistence_rag.File.uuid == file_id)
|
||||
)
|
||||
@@ -216,32 +229,289 @@ class RuntimeKnowledgeBase(KnowledgeBaseInterface):
|
||||
"""Get the name of the knowledge base"""
|
||||
return self.knowledge_base_entity.name
|
||||
|
||||
def get_type(self) -> str:
|
||||
"""Get the type of knowledge base"""
|
||||
return 'internal'
|
||||
def get_knowledge_engine_plugin_id(self) -> str:
|
||||
"""Get the Knowledge Engine plugin ID"""
|
||||
return self.knowledge_base_entity.knowledge_engine_plugin_id or ''
|
||||
|
||||
async def dispose(self):
|
||||
await self.ap.vector_db_mgr.vector_db.delete_collection(self.knowledge_base_entity.uuid)
|
||||
"""Dispose the knowledge base, notifying the plugin to cleanup."""
|
||||
await self._on_kb_delete()
|
||||
|
||||
# ========== Plugin Communication Methods ==========
|
||||
|
||||
async def _on_kb_create(self) -> None:
|
||||
"""Notify plugin about KB creation."""
|
||||
plugin_id = self.knowledge_base_entity.knowledge_engine_plugin_id
|
||||
if not plugin_id:
|
||||
return
|
||||
|
||||
try:
|
||||
config = self.knowledge_base_entity.creation_settings or {}
|
||||
self.ap.logger.info(
|
||||
f'Calling RAG plugin {plugin_id}: on_knowledge_base_create(kb_id={self.knowledge_base_entity.uuid})'
|
||||
)
|
||||
await self.ap.plugin_connector.rag_on_kb_create(plugin_id, self.knowledge_base_entity.uuid, config)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Failed to notify plugin {plugin_id} on KB create: {e}')
|
||||
raise
|
||||
|
||||
async def _on_kb_delete(self) -> None:
|
||||
"""Notify plugin about KB deletion."""
|
||||
plugin_id = self.knowledge_base_entity.knowledge_engine_plugin_id
|
||||
if not plugin_id:
|
||||
return
|
||||
|
||||
try:
|
||||
self.ap.logger.info(
|
||||
f'Calling RAG plugin {plugin_id}: on_knowledge_base_delete(kb_id={self.knowledge_base_entity.uuid})'
|
||||
)
|
||||
await self.ap.plugin_connector.rag_on_kb_delete(plugin_id, self.knowledge_base_entity.uuid)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Failed to notify plugin {plugin_id} on KB delete: {e}')
|
||||
|
||||
async def _ingest_document(
|
||||
self,
|
||||
file_metadata: dict[str, Any],
|
||||
storage_path: str,
|
||||
parsed_content: dict[str, Any] | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Call plugin to ingest document."""
|
||||
kb = self.knowledge_base_entity
|
||||
plugin_id = kb.knowledge_engine_plugin_id
|
||||
if not plugin_id:
|
||||
self.ap.logger.error(f'No RAG plugin ID configured for KB {kb.uuid}. Ingestion failed.')
|
||||
raise ValueError('RAG Plugin ID required')
|
||||
|
||||
self.ap.logger.info(f'Calling RAG plugin {plugin_id}: ingest(doc={file_metadata.get("filename")})')
|
||||
|
||||
# Inject knowledge_base_id into file metadata as required by SDK schema
|
||||
file_metadata['knowledge_base_id'] = kb.uuid
|
||||
|
||||
context_data = {
|
||||
'file_object': {
|
||||
'metadata': file_metadata,
|
||||
'storage_path': storage_path,
|
||||
},
|
||||
'knowledge_base_id': kb.uuid,
|
||||
'collection_id': kb.collection_id or kb.uuid,
|
||||
'creation_settings': kb.creation_settings or {},
|
||||
'parsed_content': parsed_content,
|
||||
}
|
||||
|
||||
try:
|
||||
result = await self.ap.plugin_connector.call_rag_ingest(plugin_id, context_data)
|
||||
return result
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Plugin ingestion failed: {e}')
|
||||
raise
|
||||
|
||||
async def _retrieve(
|
||||
self,
|
||||
query: str,
|
||||
settings: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
"""Call plugin to retrieve documents.
|
||||
|
||||
Raises:
|
||||
ValueError: If no RAG plugin is configured for this KB.
|
||||
Exception: If the plugin retrieval call fails.
|
||||
"""
|
||||
kb = self.knowledge_base_entity
|
||||
plugin_id = kb.knowledge_engine_plugin_id
|
||||
if not plugin_id:
|
||||
raise ValueError(f'No RAG plugin ID configured for KB {kb.uuid}. Retrieval failed.')
|
||||
|
||||
retrieval_context = {
|
||||
'query': query,
|
||||
'knowledge_base_id': kb.uuid,
|
||||
'collection_id': kb.collection_id or kb.uuid,
|
||||
'retrieval_settings': settings,
|
||||
'creation_settings': kb.creation_settings or {},
|
||||
'filters': settings.pop('filters', {}),
|
||||
}
|
||||
|
||||
result = await self.ap.plugin_connector.call_rag_retrieve(
|
||||
plugin_id,
|
||||
retrieval_context,
|
||||
)
|
||||
return result
|
||||
|
||||
async def _delete_document(self, document_id: str) -> bool:
|
||||
"""Call plugin to delete document."""
|
||||
kb = self.knowledge_base_entity
|
||||
plugin_id = kb.knowledge_engine_plugin_id
|
||||
if not plugin_id:
|
||||
return False
|
||||
|
||||
self.ap.logger.info(f'Calling RAG plugin {plugin_id}: delete_document(doc_id={document_id})')
|
||||
|
||||
try:
|
||||
return await self.ap.plugin_connector.call_rag_delete_document(plugin_id, document_id, kb.uuid)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Plugin document deletion failed: {e}')
|
||||
return False
|
||||
|
||||
|
||||
class RAGManager:
|
||||
ap: app.Application
|
||||
|
||||
knowledge_bases: list[KnowledgeBaseInterface]
|
||||
knowledge_bases: dict[str, KnowledgeBaseInterface]
|
||||
|
||||
def __init__(self, ap: app.Application):
|
||||
self.ap = ap
|
||||
self.knowledge_bases = []
|
||||
self.knowledge_bases = {}
|
||||
|
||||
async def initialize(self):
|
||||
await self.load_knowledge_bases_from_db()
|
||||
|
||||
async def get_all_knowledge_base_details(self) -> list[dict]:
|
||||
"""Get all knowledge bases with enriched Knowledge Engine details."""
|
||||
# 1. Get raw KBs from DB
|
||||
result = await self.ap.persistence_mgr.execute_async(sqlalchemy.select(persistence_rag.KnowledgeBase))
|
||||
knowledge_bases = result.all()
|
||||
|
||||
# 2. Get all available Knowledge Engines for enrichment
|
||||
engine_map = {}
|
||||
if self.ap.plugin_connector.is_enable_plugin:
|
||||
try:
|
||||
engines = await self.ap.plugin_connector.list_knowledge_engines()
|
||||
engine_map = {e['plugin_id']: e for e in engines}
|
||||
except Exception as e:
|
||||
self.ap.logger.warning(f'Failed to list Knowledge Engines: {e}')
|
||||
|
||||
# 3. Serialize and enrich
|
||||
kb_list = []
|
||||
for kb in knowledge_bases:
|
||||
kb_dict = self.ap.persistence_mgr.serialize_model(persistence_rag.KnowledgeBase, kb)
|
||||
self._enrich_kb_dict(kb_dict, engine_map)
|
||||
kb_list.append(kb_dict)
|
||||
|
||||
return kb_list
|
||||
|
||||
async def get_knowledge_base_details(self, kb_uuid: str) -> dict | None:
|
||||
"""Get specific knowledge base with enriched Knowledge Engine details."""
|
||||
result = await self.ap.persistence_mgr.execute_async(
|
||||
sqlalchemy.select(persistence_rag.KnowledgeBase).where(persistence_rag.KnowledgeBase.uuid == kb_uuid)
|
||||
)
|
||||
kb = result.first()
|
||||
if not kb:
|
||||
return None
|
||||
|
||||
kb_dict = self.ap.persistence_mgr.serialize_model(persistence_rag.KnowledgeBase, kb)
|
||||
|
||||
# Fetch engines
|
||||
engine_map = {}
|
||||
if self.ap.plugin_connector.is_enable_plugin:
|
||||
try:
|
||||
engines = await self.ap.plugin_connector.list_knowledge_engines()
|
||||
engine_map = {e['plugin_id']: e for e in engines}
|
||||
except Exception as e:
|
||||
self.ap.logger.warning(f'Failed to list Knowledge Engines: {e}')
|
||||
|
||||
self._enrich_kb_dict(kb_dict, engine_map)
|
||||
return kb_dict
|
||||
|
||||
@staticmethod
|
||||
def _to_i18n_name(name) -> dict:
|
||||
"""Ensure name is always an I18nObject-compatible dict.
|
||||
|
||||
If *name* is already a dict (with ``en_US`` / ``zh_Hans`` keys) it is
|
||||
returned as-is. A plain string is wrapped into an I18nObject so the
|
||||
frontend ``extractI18nObject`` helper never receives an unexpected type.
|
||||
"""
|
||||
if isinstance(name, dict):
|
||||
return name
|
||||
return {'en_US': str(name), 'zh_Hans': str(name)}
|
||||
|
||||
def _enrich_kb_dict(self, kb_dict: dict, engine_map: dict) -> None:
|
||||
"""Helper to inject engine info into KB dict."""
|
||||
plugin_id = kb_dict.get('knowledge_engine_plugin_id')
|
||||
|
||||
# Default fallback structure — name must be I18nObject for frontend compatibility
|
||||
fallback_name = self._to_i18n_name(plugin_id or 'Internal (Legacy)')
|
||||
fallback_info = {
|
||||
'plugin_id': plugin_id,
|
||||
'name': fallback_name,
|
||||
'capabilities': [],
|
||||
}
|
||||
|
||||
if not plugin_id:
|
||||
kb_dict['knowledge_engine'] = fallback_info
|
||||
return
|
||||
|
||||
engine_info = engine_map.get(plugin_id)
|
||||
if engine_info:
|
||||
kb_dict['knowledge_engine'] = {
|
||||
'plugin_id': plugin_id,
|
||||
'name': self._to_i18n_name(engine_info.get('name', plugin_id)),
|
||||
'capabilities': engine_info.get('capabilities', []),
|
||||
}
|
||||
else:
|
||||
kb_dict['knowledge_engine'] = fallback_info
|
||||
|
||||
async def create_knowledge_base(
|
||||
self,
|
||||
name: str,
|
||||
knowledge_engine_plugin_id: str,
|
||||
creation_settings: dict,
|
||||
retrieval_settings: dict | None = None,
|
||||
description: str = '',
|
||||
) -> persistence_rag.KnowledgeBase:
|
||||
"""Create a new knowledge base using a RAG plugin."""
|
||||
# Validate that the Knowledge Engine plugin exists
|
||||
if self.ap.plugin_connector.is_enable_plugin:
|
||||
try:
|
||||
engines = await self.ap.plugin_connector.list_knowledge_engines()
|
||||
engine_ids = [e.get('plugin_id') for e in engines]
|
||||
if knowledge_engine_plugin_id not in engine_ids:
|
||||
raise ValueError(f'Knowledge Engine plugin {knowledge_engine_plugin_id} not found')
|
||||
except ValueError:
|
||||
raise
|
||||
except Exception as e:
|
||||
self.ap.logger.warning(f'Failed to validate Knowledge Engine plugin existence: {e}')
|
||||
|
||||
kb_uuid = str(uuid.uuid4())
|
||||
# Use UUID as collection ID by default for isolation
|
||||
collection_id = kb_uuid
|
||||
|
||||
kb_data = {
|
||||
'uuid': kb_uuid,
|
||||
'name': name,
|
||||
'description': description,
|
||||
'knowledge_engine_plugin_id': knowledge_engine_plugin_id,
|
||||
'collection_id': collection_id,
|
||||
'creation_settings': creation_settings,
|
||||
'retrieval_settings': retrieval_settings or {},
|
||||
}
|
||||
|
||||
# Create Entity
|
||||
kb = persistence_rag.KnowledgeBase(**kb_data)
|
||||
|
||||
# Persist
|
||||
await self.ap.persistence_mgr.execute_async(sqlalchemy.insert(persistence_rag.KnowledgeBase).values(kb_data))
|
||||
|
||||
# Load into Runtime
|
||||
runtime_kb = await self.load_knowledge_base(kb)
|
||||
|
||||
# Notify Plugin — rollback DB record and runtime entry on failure
|
||||
try:
|
||||
await runtime_kb._on_kb_create()
|
||||
except Exception:
|
||||
self.knowledge_bases.pop(kb_uuid, None)
|
||||
await self.ap.persistence_mgr.execute_async(
|
||||
sqlalchemy.delete(persistence_rag.KnowledgeBase).where(persistence_rag.KnowledgeBase.uuid == kb_uuid)
|
||||
)
|
||||
raise
|
||||
|
||||
self.ap.logger.info(f'Created new Knowledge Base {name} ({kb_uuid}) using plugin {knowledge_engine_plugin_id}')
|
||||
return kb
|
||||
|
||||
async def load_knowledge_bases_from_db(self):
|
||||
self.ap.logger.info('Loading knowledge bases from db...')
|
||||
|
||||
self.knowledge_bases = []
|
||||
self.knowledge_bases = {}
|
||||
|
||||
# Load internal knowledge bases
|
||||
# Load knowledge bases
|
||||
result = await self.ap.persistence_mgr.execute_async(sqlalchemy.select(persistence_rag.KnowledgeBase))
|
||||
knowledge_bases = result.all()
|
||||
|
||||
@@ -253,86 +523,37 @@ class RAGManager:
|
||||
f'Error loading knowledge base {knowledge_base.uuid}: {e}\n{traceback.format_exc()}'
|
||||
)
|
||||
|
||||
# Load external knowledge bases
|
||||
external_result = await self.ap.persistence_mgr.execute_async(
|
||||
sqlalchemy.select(persistence_rag.ExternalKnowledgeBase)
|
||||
)
|
||||
external_kbs = external_result.all()
|
||||
|
||||
for external_kb in external_kbs:
|
||||
try:
|
||||
# Don't trigger sync during batch loading - will sync once after LangBot connects to runtime
|
||||
await self.load_external_knowledge_base(external_kb, trigger_sync=False)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(
|
||||
f'Error loading external knowledge base {external_kb.uuid}: {e}\n{traceback.format_exc()}'
|
||||
)
|
||||
|
||||
async def load_knowledge_base(
|
||||
self,
|
||||
knowledge_base_entity: persistence_rag.KnowledgeBase | sqlalchemy.Row | dict,
|
||||
) -> RuntimeKnowledgeBase:
|
||||
if isinstance(knowledge_base_entity, sqlalchemy.Row):
|
||||
# Safe access to _mapping for SQLAlchemy 1.4+
|
||||
knowledge_base_entity = persistence_rag.KnowledgeBase(**knowledge_base_entity._mapping)
|
||||
elif isinstance(knowledge_base_entity, dict):
|
||||
knowledge_base_entity = persistence_rag.KnowledgeBase(**knowledge_base_entity)
|
||||
# Filter out non-database fields (like knowledge_engine which is computed)
|
||||
filtered_dict = {
|
||||
k: v for k, v in knowledge_base_entity.items() if k in persistence_rag.KnowledgeBase.ALL_DB_FIELDS
|
||||
}
|
||||
knowledge_base_entity = persistence_rag.KnowledgeBase(**filtered_dict)
|
||||
|
||||
runtime_knowledge_base = RuntimeKnowledgeBase(ap=self.ap, knowledge_base_entity=knowledge_base_entity)
|
||||
|
||||
await runtime_knowledge_base.initialize()
|
||||
|
||||
self.knowledge_bases.append(runtime_knowledge_base)
|
||||
self.knowledge_bases[runtime_knowledge_base.get_uuid()] = runtime_knowledge_base
|
||||
|
||||
return runtime_knowledge_base
|
||||
|
||||
async def load_external_knowledge_base(
|
||||
self,
|
||||
external_kb_entity: persistence_rag.ExternalKnowledgeBase | sqlalchemy.Row | dict,
|
||||
trigger_sync: bool = True,
|
||||
) -> ExternalKnowledgeBase:
|
||||
"""Load external knowledge base into runtime
|
||||
|
||||
Args:
|
||||
external_kb_entity: External KB entity to load
|
||||
trigger_sync: Whether to trigger sync after loading (default True for manual creation, False for batch loading)
|
||||
"""
|
||||
if isinstance(external_kb_entity, sqlalchemy.Row):
|
||||
external_kb_entity = persistence_rag.ExternalKnowledgeBase(**external_kb_entity._mapping)
|
||||
elif isinstance(external_kb_entity, dict):
|
||||
external_kb_entity = persistence_rag.ExternalKnowledgeBase(**external_kb_entity)
|
||||
|
||||
external_kb = ExternalKnowledgeBase(ap=self.ap, external_kb_entity=external_kb_entity)
|
||||
|
||||
await external_kb.initialize()
|
||||
|
||||
self.knowledge_bases.append(external_kb)
|
||||
|
||||
# Trigger sync to create the instance immediately (for manual creation)
|
||||
# Skip sync during batch loading from DB to avoid multiple sync calls
|
||||
if trigger_sync:
|
||||
try:
|
||||
await self.ap.plugin_connector.sync_polymorphic_component_instances()
|
||||
self.ap.logger.info(f'Triggered sync after loading external KB {external_kb_entity.uuid}')
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Failed to sync after loading external KB: {e}')
|
||||
|
||||
return external_kb
|
||||
|
||||
async def get_knowledge_base_by_uuid(self, kb_uuid: str) -> KnowledgeBaseInterface | None:
|
||||
for kb in self.knowledge_bases:
|
||||
if kb.get_uuid() == kb_uuid:
|
||||
return kb
|
||||
return None
|
||||
return self.knowledge_bases.get(kb_uuid)
|
||||
|
||||
async def remove_knowledge_base_from_runtime(self, kb_uuid: str):
|
||||
for kb in self.knowledge_bases:
|
||||
if kb.get_uuid() == kb_uuid:
|
||||
self.knowledge_bases.remove(kb)
|
||||
return
|
||||
self.knowledge_bases.pop(kb_uuid, None)
|
||||
|
||||
async def delete_knowledge_base(self, kb_uuid: str):
|
||||
for kb in self.knowledge_bases:
|
||||
if kb.get_uuid() == kb_uuid:
|
||||
await kb.dispose()
|
||||
self.knowledge_bases.remove(kb)
|
||||
return
|
||||
kb = self.knowledge_bases.pop(kb_uuid, None)
|
||||
if kb is not None:
|
||||
await kb.dispose()
|
||||
else:
|
||||
self.ap.logger.warning(f'Knowledge base {kb_uuid} not found in runtime, skipping plugin notification')
|
||||
|
||||
@@ -1,15 +0,0 @@
|
||||
# 封装异步操作
|
||||
import asyncio
|
||||
|
||||
|
||||
class BaseService:
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
async def _run_sync(self, func, *args, **kwargs):
|
||||
"""
|
||||
在单独的线程中运行同步函数。
|
||||
如果第一个参数是 session,则在 to_thread 中获取新的 session。
|
||||
"""
|
||||
|
||||
return await asyncio.to_thread(func, *args, **kwargs)
|
||||
@@ -1,49 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from typing import List
|
||||
from langbot.pkg.rag.knowledge.services import base_service
|
||||
from langbot.pkg.core import app
|
||||
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||
|
||||
|
||||
class Chunker(base_service.BaseService):
|
||||
"""
|
||||
A class for splitting long texts into smaller, overlapping chunks.
|
||||
"""
|
||||
|
||||
def __init__(self, ap: app.Application, chunk_size: int = 500, chunk_overlap: int = 50):
|
||||
self.ap = ap
|
||||
self.chunk_size = chunk_size
|
||||
self.chunk_overlap = chunk_overlap
|
||||
if self.chunk_overlap >= self.chunk_size:
|
||||
self.ap.logger.warning(
|
||||
'Chunk overlap is greater than or equal to chunk size. This may lead to empty or malformed chunks.'
|
||||
)
|
||||
|
||||
def _split_text_sync(self, text: str) -> List[str]:
|
||||
"""
|
||||
Synchronously splits a long text into chunks with specified overlap.
|
||||
This is a CPU-bound operation, intended to be run in a separate thread.
|
||||
"""
|
||||
if not text:
|
||||
return []
|
||||
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=self.chunk_size,
|
||||
chunk_overlap=self.chunk_overlap,
|
||||
length_function=len,
|
||||
is_separator_regex=False,
|
||||
)
|
||||
return text_splitter.split_text(text)
|
||||
|
||||
async def chunk(self, text: str) -> List[str]:
|
||||
"""
|
||||
Asynchronously chunks a given text into smaller pieces.
|
||||
"""
|
||||
self.ap.logger.info(f'Chunking text (length: {len(text)})...')
|
||||
# Run the synchronous splitting logic in a separate thread
|
||||
chunks = await self._run_sync(self._split_text_sync, text)
|
||||
self.ap.logger.info(f'Text chunked into {len(chunks)} pieces.')
|
||||
self.ap.logger.debug(f'Chunks: {json.dumps(chunks, indent=4, ensure_ascii=False)}')
|
||||
return chunks
|
||||
@@ -1,55 +0,0 @@
|
||||
from __future__ import annotations
|
||||
import uuid
|
||||
from typing import List
|
||||
from langbot.pkg.rag.knowledge.services.base_service import BaseService
|
||||
from langbot.pkg.entity.persistence import rag as persistence_rag
|
||||
from langbot.pkg.core import app
|
||||
from langbot.pkg.provider.modelmgr.requester import RuntimeEmbeddingModel
|
||||
import sqlalchemy
|
||||
|
||||
|
||||
class Embedder(BaseService):
|
||||
def __init__(self, ap: app.Application) -> None:
|
||||
super().__init__()
|
||||
self.ap = ap
|
||||
|
||||
async def embed_and_store(
|
||||
self, kb_id: str, file_id: str, chunks: List[str], embedding_model: RuntimeEmbeddingModel
|
||||
) -> list[persistence_rag.Chunk]:
|
||||
# save chunk to db
|
||||
chunk_entities: list[persistence_rag.Chunk] = []
|
||||
chunk_ids: list[str] = []
|
||||
|
||||
for chunk_text in chunks:
|
||||
chunk_uuid = str(uuid.uuid4())
|
||||
chunk_ids.append(chunk_uuid)
|
||||
chunk_entity = persistence_rag.Chunk(uuid=chunk_uuid, file_id=file_id, text=chunk_text)
|
||||
chunk_entities.append(chunk_entity)
|
||||
|
||||
chunk_dicts = [
|
||||
self.ap.persistence_mgr.serialize_model(persistence_rag.Chunk, chunk) for chunk in chunk_entities
|
||||
]
|
||||
|
||||
await self.ap.persistence_mgr.execute_async(sqlalchemy.insert(persistence_rag.Chunk).values(chunk_dicts))
|
||||
|
||||
# get embeddings (batch size limit: 64 for OpenAI)
|
||||
MAX_BATCH_SIZE = 64
|
||||
embeddings_list: list[list[float]] = []
|
||||
|
||||
for i in range(0, len(chunks), MAX_BATCH_SIZE):
|
||||
batch = chunks[i : i + MAX_BATCH_SIZE]
|
||||
batch_embeddings = await embedding_model.provider.invoke_embedding(
|
||||
model=embedding_model,
|
||||
input_text=batch,
|
||||
extra_args={}, # TODO: add extra args
|
||||
knowledge_base_id=kb_id,
|
||||
call_type='embedding',
|
||||
)
|
||||
embeddings_list.extend(batch_embeddings)
|
||||
|
||||
# save embeddings to vdb
|
||||
await self.ap.vector_db_mgr.vector_db.add_embeddings(kb_id, chunk_ids, embeddings_list, chunk_dicts)
|
||||
|
||||
self.ap.logger.info(f'Successfully saved {len(chunk_entities)} embeddings to Knowledge Base.')
|
||||
|
||||
return chunk_entities
|
||||
@@ -1,291 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import PyPDF2
|
||||
import io
|
||||
from docx import Document
|
||||
import chardet
|
||||
from typing import Union, Callable, Any
|
||||
import markdown
|
||||
from bs4 import BeautifulSoup
|
||||
import re
|
||||
import asyncio # Import asyncio for async operations
|
||||
from langbot.pkg.core import app
|
||||
|
||||
|
||||
class FileParser:
|
||||
"""
|
||||
A robust file parser class to extract text content from various document formats.
|
||||
It supports TXT, PDF, DOCX, XLSX, CSV, Markdown, HTML, and EPUB files.
|
||||
All core file reading operations are designed to be run synchronously in a thread pool
|
||||
to avoid blocking the asyncio event loop.
|
||||
"""
|
||||
|
||||
def __init__(self, ap: app.Application):
|
||||
self.ap = ap
|
||||
|
||||
async def _run_sync(self, sync_func: Callable, *args: Any, **kwargs: Any) -> Any:
|
||||
"""
|
||||
Runs a synchronous function in a separate thread to prevent blocking the event loop.
|
||||
This is a general utility method for wrapping blocking I/O operations.
|
||||
"""
|
||||
try:
|
||||
return await asyncio.to_thread(sync_func, *args, **kwargs)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Error running synchronous function {sync_func.__name__}: {e}')
|
||||
raise
|
||||
|
||||
async def parse(self, file_name: str, extension: str) -> Union[str, None]:
|
||||
"""
|
||||
Parses the file based on its extension and returns the extracted text content.
|
||||
This is the main asynchronous entry point for parsing.
|
||||
|
||||
Args:
|
||||
file_name (str): The name of the file to be parsed, get from ap.storage_mgr
|
||||
|
||||
Returns:
|
||||
Union[str, None]: The extracted text content as a single string, or None if parsing fails.
|
||||
"""
|
||||
|
||||
file_extension = extension.lower()
|
||||
parser_method = getattr(self, f'_parse_{file_extension}', None)
|
||||
|
||||
if parser_method is None:
|
||||
self.ap.logger.error(f'Unsupported file format: {file_extension} for file {file_name}')
|
||||
return None
|
||||
|
||||
try:
|
||||
# Pass file_path to the specific parser methods
|
||||
return await parser_method(file_name)
|
||||
except Exception as e:
|
||||
self.ap.logger.error(f'Failed to parse {file_extension} file {file_name}: {e}')
|
||||
return None
|
||||
|
||||
# --- Helper for reading files with encoding detection ---
|
||||
async def _read_file_content(self, file_name: str) -> Union[str, bytes]:
|
||||
"""
|
||||
Reads a file with automatic encoding detection, ensuring the synchronous
|
||||
file read operation runs in a separate thread.
|
||||
"""
|
||||
|
||||
# def _read_sync():
|
||||
# with open(file_path, 'rb') as file:
|
||||
# raw_data = file.read()
|
||||
# detected = chardet.detect(raw_data)
|
||||
# encoding = detected['encoding'] or 'utf-8'
|
||||
|
||||
# if mode == 'r':
|
||||
# return raw_data.decode(encoding, errors='ignore')
|
||||
# return raw_data # For binary mode
|
||||
|
||||
# return await self._run_sync(_read_sync)
|
||||
file_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
detected = chardet.detect(file_bytes)
|
||||
encoding = detected['encoding'] or 'utf-8'
|
||||
|
||||
return file_bytes.decode(encoding, errors='ignore')
|
||||
|
||||
# --- Specific Parser Methods ---
|
||||
|
||||
async def _parse_txt(self, file_name: str) -> str:
|
||||
"""Parses a TXT file and returns its content."""
|
||||
self.ap.logger.info(f'Parsing TXT file: {file_name}')
|
||||
return await self._read_file_content(file_name)
|
||||
|
||||
async def _parse_pdf(self, file_name: str) -> str:
|
||||
"""Parses a PDF file and returns its text content."""
|
||||
self.ap.logger.info(f'Parsing PDF file: {file_name}')
|
||||
|
||||
# def _parse_pdf_sync():
|
||||
# text_content = []
|
||||
# with open(file_name, 'rb') as file:
|
||||
# pdf_reader = PyPDF2.PdfReader(file)
|
||||
# for page in pdf_reader.pages:
|
||||
# text = page.extract_text()
|
||||
# if text:
|
||||
# text_content.append(text)
|
||||
# return '\n'.join(text_content)
|
||||
|
||||
# return await self._run_sync(_parse_pdf_sync)
|
||||
|
||||
pdf_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
def _parse_pdf_sync():
|
||||
pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_bytes))
|
||||
text_content = []
|
||||
for page in pdf_reader.pages:
|
||||
text = page.extract_text()
|
||||
if text:
|
||||
text_content.append(text)
|
||||
return '\n'.join(text_content)
|
||||
|
||||
return await self._run_sync(_parse_pdf_sync)
|
||||
|
||||
async def _parse_docx(self, file_name: str) -> str:
|
||||
"""Parses a DOCX file and returns its text content."""
|
||||
self.ap.logger.info(f'Parsing DOCX file: {file_name}')
|
||||
|
||||
docx_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
def _parse_docx_sync():
|
||||
doc = Document(io.BytesIO(docx_bytes))
|
||||
text_content = [paragraph.text for paragraph in doc.paragraphs if paragraph.text.strip()]
|
||||
return '\n'.join(text_content)
|
||||
|
||||
return await self._run_sync(_parse_docx_sync)
|
||||
|
||||
async def _parse_doc(self, file_name: str) -> str:
|
||||
"""Handles .doc files, explicitly stating lack of direct support."""
|
||||
self.ap.logger.warning(f'Direct .doc parsing is not supported for {file_name}. Please convert to .docx first.')
|
||||
raise NotImplementedError('Direct .doc parsing not supported. Please convert to .docx first.')
|
||||
|
||||
# async def _parse_xlsx(self, file_name: str) -> str:
|
||||
# """Parses an XLSX file, returning text from all sheets."""
|
||||
# self.ap.logger.info(f'Parsing XLSX file: {file_name}')
|
||||
|
||||
# xlsx_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
# def _parse_xlsx_sync():
|
||||
# excel_file = pd.ExcelFile(io.BytesIO(xlsx_bytes))
|
||||
# all_sheet_content = []
|
||||
# for sheet_name in excel_file.sheet_names:
|
||||
# df = pd.read_excel(io.BytesIO(xlsx_bytes), sheet_name=sheet_name)
|
||||
# sheet_text = f'--- Sheet: {sheet_name} ---\n{df.to_string(index=False)}\n'
|
||||
# all_sheet_content.append(sheet_text)
|
||||
# return '\n'.join(all_sheet_content)
|
||||
|
||||
# return await self._run_sync(_parse_xlsx_sync)
|
||||
|
||||
# async def _parse_csv(self, file_name: str) -> str:
|
||||
# """Parses a CSV file and returns its content as a string."""
|
||||
# self.ap.logger.info(f'Parsing CSV file: {file_name}')
|
||||
|
||||
# csv_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
# def _parse_csv_sync():
|
||||
# # pd.read_csv can often detect encoding, but explicit detection is safer
|
||||
# # raw_data = self._read_file_content(
|
||||
# # file_name, mode='rb'
|
||||
# # ) # Note: this will need to be await outside this sync function
|
||||
# # _ = raw_data
|
||||
# # For simplicity, we'll let pandas handle encoding internally after a raw read.
|
||||
# # A more robust solution might pass encoding directly to pd.read_csv after detection.
|
||||
# detected = chardet.detect(io.BytesIO(csv_bytes))
|
||||
# encoding = detected['encoding'] or 'utf-8'
|
||||
# df = pd.read_csv(io.BytesIO(csv_bytes), encoding=encoding)
|
||||
# return df.to_string(index=False)
|
||||
|
||||
# return await self._run_sync(_parse_csv_sync)
|
||||
|
||||
async def _parse_md(self, file_name: str) -> str:
|
||||
"""Parses a Markdown file, converting it to structured plain text."""
|
||||
self.ap.logger.info(f'Parsing Markdown file: {file_name}')
|
||||
|
||||
md_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
def _parse_markdown_sync():
|
||||
md_content = io.BytesIO(md_bytes).read().decode('utf-8', errors='ignore')
|
||||
html_content = markdown.markdown(
|
||||
md_content, extensions=['extra', 'codehilite', 'tables', 'toc', 'fenced_code']
|
||||
)
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
text_parts = []
|
||||
for element in soup.children:
|
||||
if element.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
|
||||
level = int(element.name[1])
|
||||
text_parts.append('#' * level + ' ' + element.get_text().strip())
|
||||
elif element.name == 'p':
|
||||
text = element.get_text().strip()
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
elif element.name in ['ul', 'ol']:
|
||||
for li in element.find_all('li'):
|
||||
text_parts.append(f'* {li.get_text().strip()}')
|
||||
elif element.name == 'pre':
|
||||
code_block = element.get_text().strip()
|
||||
if code_block:
|
||||
text_parts.append(f'```\n{code_block}\n```')
|
||||
elif element.name == 'table':
|
||||
table_str = self._extract_table_to_markdown_sync(element) # Call sync helper
|
||||
if table_str:
|
||||
text_parts.append(table_str)
|
||||
elif element.name:
|
||||
text = element.get_text(separator=' ', strip=True)
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
cleaned_text = re.sub(r'\n\s*\n', '\n\n', '\n'.join(text_parts))
|
||||
return cleaned_text.strip()
|
||||
|
||||
return await self._run_sync(_parse_markdown_sync)
|
||||
|
||||
async def _parse_html(self, file_name: str) -> str:
|
||||
"""Parses an HTML file, extracting structured plain text."""
|
||||
self.ap.logger.info(f'Parsing HTML file: {file_name}')
|
||||
|
||||
html_bytes = await self.ap.storage_mgr.storage_provider.load(file_name)
|
||||
|
||||
def _parse_html_sync():
|
||||
html_content = io.BytesIO(html_bytes).read().decode('utf-8', errors='ignore')
|
||||
soup = BeautifulSoup(html_content, 'html.parser')
|
||||
for script_or_style in soup(['script', 'style']):
|
||||
script_or_style.decompose()
|
||||
text_parts = []
|
||||
for element in soup.body.children if soup.body else soup.children:
|
||||
if element.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
|
||||
level = int(element.name[1])
|
||||
text_parts.append('#' * level + ' ' + element.get_text().strip())
|
||||
elif element.name == 'p':
|
||||
text = element.get_text().strip()
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
elif element.name in ['ul', 'ol']:
|
||||
for li in element.find_all('li'):
|
||||
text = li.get_text().strip()
|
||||
if text:
|
||||
text_parts.append(f'* {text}')
|
||||
elif element.name == 'table':
|
||||
table_str = self._extract_table_to_markdown_sync(element) # Call sync helper
|
||||
if table_str:
|
||||
text_parts.append(table_str)
|
||||
elif element.name:
|
||||
text = element.get_text(separator=' ', strip=True)
|
||||
if text:
|
||||
text_parts.append(text)
|
||||
cleaned_text = re.sub(r'\n\s*\n', '\n\n', '\n'.join(text_parts))
|
||||
return cleaned_text.strip()
|
||||
|
||||
return await self._run_sync(_parse_html_sync)
|
||||
|
||||
def _add_toc_items_sync(self, toc_list: list, text_content: list, level: int):
|
||||
"""Recursively adds TOC items to text_content (synchronous helper)."""
|
||||
indent = ' ' * level
|
||||
for item in toc_list:
|
||||
if isinstance(item, tuple):
|
||||
chapter, subchapters = item
|
||||
text_content.append(f'{indent}- {chapter.title}')
|
||||
self._add_toc_items_sync(subchapters, text_content, level + 1)
|
||||
else:
|
||||
text_content.append(f'{indent}- {item.title}')
|
||||
|
||||
def _extract_table_to_markdown_sync(self, table_element: BeautifulSoup) -> str:
|
||||
"""Helper to convert a BeautifulSoup table element into a Markdown table string (synchronous)."""
|
||||
headers = [th.get_text().strip() for th in table_element.find_all('th')]
|
||||
rows = []
|
||||
for tr in table_element.find_all('tr'):
|
||||
cells = [td.get_text().strip() for td in tr.find_all('td')]
|
||||
if cells:
|
||||
rows.append(cells)
|
||||
|
||||
if not headers and not rows:
|
||||
return ''
|
||||
|
||||
table_lines = []
|
||||
if headers:
|
||||
table_lines.append(' | '.join(headers))
|
||||
table_lines.append(' | '.join(['---'] * len(headers)))
|
||||
|
||||
for row_cells in rows:
|
||||
padded_cells = row_cells + [''] * (len(headers) - len(row_cells)) if headers else row_cells
|
||||
table_lines.append(' | '.join(padded_cells))
|
||||
|
||||
return '\n'.join(table_lines)
|
||||
@@ -1,53 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from . import base_service
|
||||
from ....core import app
|
||||
from ....provider.modelmgr.requester import RuntimeEmbeddingModel
|
||||
from langbot_plugin.api.entities.builtin.rag import context as rag_context
|
||||
from langbot_plugin.api.entities.builtin.provider.message import ContentElement
|
||||
|
||||
|
||||
class Retriever(base_service.BaseService):
|
||||
def __init__(self, ap: app.Application):
|
||||
super().__init__()
|
||||
self.ap = ap
|
||||
|
||||
async def retrieve(
|
||||
self, kb_id: str, query: str, embedding_model: RuntimeEmbeddingModel, k: int = 5
|
||||
) -> list[rag_context.RetrievalResultEntry]:
|
||||
self.ap.logger.info(
|
||||
f"Retrieving for query: '{query[:10]}' with k={k} using {embedding_model.model_entity.uuid}"
|
||||
)
|
||||
|
||||
query_embedding: list[float] = await embedding_model.provider.invoke_embedding(
|
||||
model=embedding_model,
|
||||
input_text=[query],
|
||||
extra_args={}, # TODO: add extra args
|
||||
knowledge_base_id=kb_id,
|
||||
query_text=query,
|
||||
call_type='retrieve',
|
||||
)
|
||||
|
||||
vector_results = await self.ap.vector_db_mgr.vector_db.search(kb_id, query_embedding[0], k)
|
||||
|
||||
# 'ids' shape mirrors the Chroma-style response contract for compatibility
|
||||
matched_vector_ids = vector_results.get('ids', [[]])[0]
|
||||
distances = vector_results.get('distances', [[]])[0]
|
||||
vector_metadatas = vector_results.get('metadatas', [[]])[0]
|
||||
|
||||
if not matched_vector_ids:
|
||||
self.ap.logger.info('No relevant chunks found in vector database.')
|
||||
return []
|
||||
|
||||
result: list[rag_context.RetrievalResultEntry] = []
|
||||
|
||||
for i, id in enumerate(matched_vector_ids):
|
||||
entry = rag_context.RetrievalResultEntry(
|
||||
id=id,
|
||||
content=[ContentElement.from_text(vector_metadatas[i].get('text', ''))],
|
||||
metadata=vector_metadatas[i],
|
||||
distance=distances[i],
|
||||
)
|
||||
result.append(entry)
|
||||
|
||||
return result
|
||||
Reference in New Issue
Block a user