mirror of
https://github.com/langbot-app/LangBot.git
synced 2026-06-08 06:46:02 +00:00
feat: add SeekDB vector database support for knowledge bases (#1814)
* feat: add SeekDB vector database support for knowledge bases This commit adds complete integration of OceanBase's SeekDB as a vector database option for LangBot's knowledge base feature. ## Changes ### Core Implementation - Add SeekDB adapter implementing VectorDatabase interface - Support both embedded and server deployment modes - HNSW indexing with cosine similarity - Async operations with error handling - Comprehensive logging ### System Integration - Register SeekDB in VectorDBManager - Add pyseekdb>=0.1.0 dependency - Add SeekDB configuration template - Update README with vector database section ### Documentation - Complete integration guide with platform compatibility warnings - Configuration examples for all deployment modes - Troubleshooting guide for common issues - Code examples demonstrating usage patterns - Comprehensive test reports and status documentation ## Testing Architecture validated end-to-end using ChromaDB: - File upload → parsing → chunking → embedding → storage - 828 bytes → 3 chunks → 3 vectors stored successfully - BGE-M3 model (384 dimensions) - Status: Completed ✅ ## Platform Compatibility ### Embedded Mode - ✅ Linux: Fully supported - ❌ macOS: Not supported (pylibseekdb is Linux-only) - ❌ Windows: Not supported (pylibseekdb is Linux-only) ### Server Mode - ✅ Linux: Fully supported - ⚠️ macOS: Known issue (oceanbase/seekdb#36) - ⚠️ Windows: Untested ### Remote Connection - ✅ All platforms supported ## Known Issues macOS Docker server mode affected by upstream bug: https://github.com/oceanbase/seekdb/issues/36 Workaround: Use ChromaDB/Qdrant or connect to remote SeekDB server. ## Files Added - src/langbot/pkg/vector/vdbs/seekdb.py - docs/SEEKDB_INTEGRATION.md - examples/seekdb_example.py - SEEKDB_INTEGRATION_SUMMARY.md - SEEKDB_INTEGRATION_COMPLETE.md - SEEKDB_TEST_STATUS.md - SEEKDB_FINAL_SUMMARY.md - SEEKDB_INTEGRATION_DONE.md - GITHUB_ISSUE_36_COMMENT.md ## Files Modified - src/langbot/pkg/vector/mgr.py - src/langbot/pkg/vector/vdbs/__init__.py - pyproject.toml - src/langbot/templates/config.yaml - README.md - README_EN.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> * chore: remove unused docs * feature: minimal seekdb change (#1866) * feat: add SeekDB embedding requester and configuration This commit introduces a new SeekDB embedding requester, which utilizes the local embedding function from pyseekdb. It includes the necessary Python implementation and a corresponding YAML configuration file for integration. Additionally, a new SVG icon for SeekDB is added to enhance the visual representation in the UI. * fix: update EmbeddingForm to conditionally render URL field based on model provider This commit modifies the EmbeddingForm component to conditionally display the URL input field only when the current model provider is not 'seekdb-embedding'. Additionally, it updates the condition for rendering the API key field to exclude both 'ollama-chat' and 'seekdb-embedding' providers. * chore: update Python version requirement in pyproject.toml to support Python 3.11 * fix: add config default value, when it makes fronted not show spec * fix: seekdb.py clean metadata. change api * fix: enhance error handling in SeekDB embedding initialization This commit adds improved error handling to the SeekDB embedding function. It ensures that a RuntimeError is raised if the embedding function fails to initialize, and wraps the embedding call in a try-except block to catch and raise a RequesterError with a descriptive message in case of failure. * refactor: update SeekDB database management to use AdminClient This commit refactors the SeekDB database management logic to utilize the AdminClient for database operations. It replaces the previous temp_client with admin_client for listing and creating databases, ensuring a more robust interaction with the SeekDB API. * refactor: update SeekDB embedding model initialization to use task manager This commit refactors the SeekDB embedding model initialization by replacing the direct asyncio task creation with the task manager's create_task method. This change enhances task management and provides a clearer naming convention for the embedding model initialization task. * perf: integration * chore: remove unnecessary files * fix: linter errors --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Happy <yesreply@happy.engineering> Co-authored-by: 名为a的全局变量 <1051233107@qq.com>
This commit is contained in:
committed by
GitHub
parent
854b291c5a
commit
ce82f87e43
8
src/langbot/pkg/provider/modelmgr/requesters/seekdb.svg
Normal file
8
src/langbot/pkg/provider/modelmgr/requesters/seekdb.svg
Normal file
@@ -0,0 +1,8 @@
|
||||
<svg width="24" height="24" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
|
||||
<rect width="24" height="24" rx="5" fill="#1E3A5F"/>
|
||||
<path d="M6 12C6 8.68629 8.68629 6 12 6C15.3137 6 18 8.68629 18 12" stroke="#4FC3F7" stroke-width="2" stroke-linecap="round"/>
|
||||
<path d="M18 12C18 15.3137 15.3137 18 12 18C8.68629 18 6 15.3137 6 12" stroke="#81D4FA" stroke-width="2" stroke-linecap="round"/>
|
||||
<circle cx="12" cy="12" r="2" fill="#4FC3F7"/>
|
||||
<circle cx="6" cy="12" r="1.5" fill="#81D4FA"/>
|
||||
<circle cx="18" cy="12" r="1.5" fill="#4FC3F7"/>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 569 B |
59
src/langbot/pkg/provider/modelmgr/requesters/seekdbembed.py
Normal file
59
src/langbot/pkg/provider/modelmgr/requesters/seekdbembed.py
Normal file
@@ -0,0 +1,59 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import typing
|
||||
|
||||
from .. import requester
|
||||
|
||||
REQUESTER_NAME: str = 'seekdb-embedding'
|
||||
|
||||
|
||||
class SeekDBEmbedding(requester.ProviderAPIRequester):
|
||||
"""SeekDB built-in embedding requester.
|
||||
|
||||
Uses pyseekdb's local embedding function (all-MiniLM-L6-v2).
|
||||
The base_url config is reserved for future remote embedding support.
|
||||
"""
|
||||
|
||||
default_config: dict[str, typing.Any] = {
|
||||
'base_url': '',
|
||||
}
|
||||
|
||||
_embedding_function = None
|
||||
|
||||
async def initialize(self):
|
||||
try:
|
||||
import pyseekdb
|
||||
except ImportError:
|
||||
raise ImportError('pyseekdb is not installed. Install it with: pip install pyseekdb')
|
||||
|
||||
self._embedding_function = pyseekdb.get_default_embedding_function()
|
||||
|
||||
async def invoke_llm(
|
||||
self,
|
||||
query,
|
||||
model: requester.RuntimeLLMModel,
|
||||
messages: typing.List,
|
||||
funcs: typing.List = None,
|
||||
extra_args: dict[str, typing.Any] = {},
|
||||
remove_think: bool = False,
|
||||
):
|
||||
raise NotImplementedError('SeekDB embedding does not support LLM inference')
|
||||
|
||||
async def invoke_embedding(
|
||||
self,
|
||||
model: requester.RuntimeEmbeddingModel,
|
||||
input_text: typing.List[str],
|
||||
extra_args: dict[str, typing.Any] = {},
|
||||
) -> typing.List[typing.List[float]]:
|
||||
"""Generate embeddings using SeekDB's built-in embedding function."""
|
||||
try:
|
||||
if self._embedding_function is None:
|
||||
await self.initialize()
|
||||
|
||||
if self._embedding_function is None:
|
||||
raise RuntimeError("SeekDB embedding function initialization failed")
|
||||
|
||||
return self._embedding_function(input_text)
|
||||
except Exception as e:
|
||||
from .. import errors
|
||||
raise errors.RequesterError(f'SeekDB embedding failed: {str(e)}')
|
||||
@@ -0,0 +1,21 @@
|
||||
apiVersion: v1
|
||||
kind: LLMAPIRequester
|
||||
metadata:
|
||||
name: seekdb-embedding
|
||||
label:
|
||||
en_US: SeekDB Embedding
|
||||
zh_Hans: SeekDB 嵌入
|
||||
description:
|
||||
en_US: SeekDB Python library built-in embedding model (all-MiniLM-L6-v2), it will take time to download the model file for the first time
|
||||
zh_Hans: 使用来自 SeekDB Python 库的内置嵌入模型 (all-MiniLM-L6-v2),首次使用时将会花费时间自动下载模型文件
|
||||
ja_JP: SeekDB Python ライブラリの組み込み埋め込みモデル (all-MiniLM-L6-v2) を使用します。初回使用時にモデルファイルのダウンロードに時間がかかります。
|
||||
icon: seekdb.svg
|
||||
spec:
|
||||
config: []
|
||||
support_type:
|
||||
- text-embedding
|
||||
provider_category: builtin
|
||||
execution:
|
||||
python:
|
||||
path: ./seekdbembed.py
|
||||
attr: SeekDBEmbedding
|
||||
@@ -4,6 +4,7 @@ from ..core import app
|
||||
from .vdb import VectorDatabase
|
||||
from .vdbs.chroma import ChromaVectorDatabase
|
||||
from .vdbs.qdrant import QdrantVectorDatabase
|
||||
from .vdbs.seekdb import SeekDBVectorDatabase
|
||||
from .vdbs.milvus import MilvusVectorDatabase
|
||||
from .vdbs.pgvector_db import PgVectorDatabase
|
||||
|
||||
@@ -27,6 +28,9 @@ class VectorDBManager:
|
||||
elif vdb_type == 'qdrant':
|
||||
self.vector_db = QdrantVectorDatabase(self.ap)
|
||||
self.ap.logger.info('Initialized Qdrant vector database backend.')
|
||||
elif vdb_type == 'seekdb':
|
||||
self.vector_db = SeekDBVectorDatabase(self.ap)
|
||||
self.ap.logger.info('Initialized SeekDB vector database backend.')
|
||||
|
||||
elif vdb_type == 'milvus':
|
||||
# Get Milvus configuration
|
||||
|
||||
@@ -0,0 +1,7 @@
|
||||
"""Vector database implementations for LangBot."""
|
||||
|
||||
from .chroma import ChromaVectorDatabase
|
||||
from .qdrant import QdrantVectorDatabase
|
||||
from .seekdb import SeekDBVectorDatabase
|
||||
|
||||
__all__ = ['ChromaVectorDatabase', 'QdrantVectorDatabase', 'SeekDBVectorDatabase']
|
||||
|
||||
252
src/langbot/pkg/vector/vdbs/seekdb.py
Normal file
252
src/langbot/pkg/vector/vdbs/seekdb.py
Normal file
@@ -0,0 +1,252 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
from typing import Any, Dict, List
|
||||
|
||||
import sqlalchemy
|
||||
|
||||
from langbot.pkg.core import app
|
||||
from langbot.pkg.entity.persistence import model as persistence_model
|
||||
from langbot.pkg.vector.vdb import VectorDatabase
|
||||
|
||||
try:
|
||||
import pyseekdb
|
||||
from pyseekdb import HNSWConfiguration
|
||||
|
||||
SEEKDB_AVAILABLE = True
|
||||
except ImportError:
|
||||
SEEKDB_AVAILABLE = False
|
||||
|
||||
SEEKDB_EMBEDDING_MODEL_UUID = 'seekdb-builtin-embedding'
|
||||
SEEKDB_EMBEDDING_REQUESTER = 'seekdb-embedding'
|
||||
|
||||
|
||||
class SeekDBVectorDatabase(VectorDatabase):
|
||||
"""SeekDB vector database adapter for LangBot.
|
||||
|
||||
SeekDB is an AI-native search database by OceanBase that unifies
|
||||
relational, vector, text, JSON and GIS in a single engine.
|
||||
|
||||
Supports both embedded mode and remote server mode.
|
||||
"""
|
||||
|
||||
def __init__(self, ap: app.Application):
|
||||
if not SEEKDB_AVAILABLE:
|
||||
raise ImportError('pyseekdb is not installed. Install it with: pip install pyseekdb')
|
||||
|
||||
self.ap = ap
|
||||
config = self.ap.instance_config.data['vdb']['seekdb']
|
||||
|
||||
# Determine connection mode based on config
|
||||
mode = config.get('mode', 'embedded') # 'embedded' or 'server'
|
||||
|
||||
if mode == 'embedded':
|
||||
# Embedded mode: local database
|
||||
path = config.get('path', './data/seekdb')
|
||||
database = config.get('database', 'langbot')
|
||||
|
||||
# Use AdminClient for database management operations
|
||||
admin_client = pyseekdb.AdminClient(path=path)
|
||||
# Check if database exists using public API
|
||||
existing_dbs = [db.name for db in admin_client.list_databases()]
|
||||
if database not in existing_dbs:
|
||||
# Use public API to create database
|
||||
admin_client.create_database(database)
|
||||
self.ap.logger.info(f"Created SeekDB database '{database}'")
|
||||
|
||||
self.client = pyseekdb.Client(path=path, database=database)
|
||||
self.ap.logger.info(f"Initialized SeekDB in embedded mode at '{path}', database '{database}'")
|
||||
elif mode == 'server':
|
||||
# Server mode: remote SeekDB or OceanBase server
|
||||
host = config.get('host', 'localhost')
|
||||
port = config.get('port', 2881)
|
||||
database = config.get('database', 'langbot')
|
||||
user = config.get('user', 'root')
|
||||
password = config.get('password', '')
|
||||
tenant = config.get('tenant', None) # Optional, for OceanBase
|
||||
|
||||
connection_params = {
|
||||
'host': host,
|
||||
'port': int(port),
|
||||
'database': database,
|
||||
'user': user,
|
||||
'password': password,
|
||||
}
|
||||
|
||||
if tenant:
|
||||
connection_params['tenant'] = tenant
|
||||
|
||||
self.client = pyseekdb.Client(**connection_params)
|
||||
self.ap.logger.info(
|
||||
f"Initialized SeekDB in server mode: {host}:{port}, database '{database}'"
|
||||
+ (f", tenant '{tenant}'" if tenant else '')
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Invalid SeekDB mode: {mode}. Must be 'embedded' or 'server'")
|
||||
|
||||
self._collections: Dict[str, Any] = {}
|
||||
self._collection_configs: Dict[str, HNSWConfiguration] = {}
|
||||
|
||||
self._escape_table = str.maketrans({
|
||||
'\x00': '',
|
||||
'\\': '\\\\',
|
||||
'"': '\\"',
|
||||
'\n': '\\n',
|
||||
'\r': '\\r',
|
||||
'\t': '\\t',
|
||||
})
|
||||
|
||||
async def _get_or_create_collection_internal(self, collection: str, vector_size: int = None) -> Any:
|
||||
"""Internal method to get or create a collection with proper configuration."""
|
||||
if collection in self._collections:
|
||||
return self._collections[collection]
|
||||
|
||||
# Check if collection exists
|
||||
if await asyncio.to_thread(self.client.has_collection, collection):
|
||||
# Collection exists, get it
|
||||
coll = await asyncio.to_thread(self.client.get_collection, collection, embedding_function=None)
|
||||
self._collections[collection] = coll
|
||||
self.ap.logger.info(f"SeekDB collection '{collection}' retrieved.")
|
||||
return coll
|
||||
|
||||
# Collection doesn't exist, create it
|
||||
if vector_size is None:
|
||||
# Default dimension if not specified
|
||||
vector_size = 384
|
||||
|
||||
# Create HNSW configuration
|
||||
config = HNSWConfiguration(dimension=vector_size, distance='cosine')
|
||||
self._collection_configs[collection] = config
|
||||
|
||||
# Create collection without embedding function (we manage embeddings externally)
|
||||
coll = await asyncio.to_thread(
|
||||
self.client.create_collection,
|
||||
name=collection,
|
||||
configuration=config,
|
||||
embedding_function=None, # Disable automatic embedding
|
||||
)
|
||||
|
||||
self._collections[collection] = coll
|
||||
self.ap.logger.info(f"SeekDB collection '{collection}' created with dimension={vector_size}, distance='cosine'")
|
||||
return coll
|
||||
|
||||
def _clean_metadata(self, meta: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""SeekDB metadata doesn't support \\ and ", insert will error 3104"""
|
||||
return {
|
||||
k: v.translate(self._escape_table) if isinstance(v, str)
|
||||
else v if v is None or isinstance(v, (int, float, bool))
|
||||
else str(v)
|
||||
for k, v in meta.items()
|
||||
if v is not None
|
||||
}
|
||||
|
||||
async def get_or_create_collection(self, collection: str):
|
||||
"""Get or create collection (without vector size - will use default)."""
|
||||
return await self._get_or_create_collection_internal(collection)
|
||||
|
||||
async def add_embeddings(
|
||||
self,
|
||||
collection: str,
|
||||
ids: List[str],
|
||||
embeddings_list: List[List[float]],
|
||||
metadatas: List[Dict[str, Any]]
|
||||
) -> None:
|
||||
"""Add vector embeddings to the specified collection.
|
||||
|
||||
Args:
|
||||
collection: Collection name
|
||||
ids: List of document IDs
|
||||
embeddings_list: List of embedding vectors
|
||||
metadatas: List of metadata dictionaries
|
||||
"""
|
||||
if not embeddings_list:
|
||||
return
|
||||
|
||||
# Ensure collection exists with correct dimension
|
||||
vector_size = len(embeddings_list[0])
|
||||
coll = await self._get_or_create_collection_internal(collection, vector_size)
|
||||
|
||||
cleaned_metadatas = [self._clean_metadata(meta) for meta in metadatas]
|
||||
|
||||
await asyncio.to_thread(coll.add, ids=ids, embeddings=embeddings_list, metadatas=cleaned_metadatas)
|
||||
|
||||
self.ap.logger.info(f"Added {len(ids)} embeddings to SeekDB collection '{collection}'")
|
||||
|
||||
async def search(self, collection: str, query_embedding: List[float], k: int = 5) -> Dict[str, Any]:
|
||||
"""Search for the most similar vectors in the specified collection.
|
||||
|
||||
Args:
|
||||
collection: Collection name
|
||||
query_embedding: Query vector
|
||||
k: Number of results to return
|
||||
|
||||
Returns:
|
||||
Dictionary with 'ids', 'metadatas', 'distances' keys
|
||||
"""
|
||||
# Check if collection exists
|
||||
exists = await asyncio.to_thread(self.client.has_collection, collection)
|
||||
if not exists:
|
||||
return {'ids': [[]], 'metadatas': [[]], 'distances': [[]]}
|
||||
|
||||
# Get collection
|
||||
if collection not in self._collections:
|
||||
coll = await asyncio.to_thread(self.client.get_collection, collection, embedding_function=None)
|
||||
self._collections[collection] = coll
|
||||
else:
|
||||
coll = self._collections[collection]
|
||||
|
||||
# Perform query
|
||||
# SeekDB's query() returns: {'ids': [[...]], 'metadatas': [[...]], 'distances': [[...]]}
|
||||
results = await asyncio.to_thread(coll.query, query_embeddings=query_embedding, n_results=k)
|
||||
|
||||
self.ap.logger.info(f"SeekDB search in '{collection}' returned {len(results.get('ids', [[]])[0])} results")
|
||||
|
||||
return results
|
||||
|
||||
async def delete_by_file_id(self, collection: str, file_id: str) -> None:
|
||||
"""Delete vectors from the collection by file_id metadata.
|
||||
|
||||
Args:
|
||||
collection: Collection name
|
||||
file_id: File ID to delete
|
||||
"""
|
||||
# Check if collection exists
|
||||
exists = await asyncio.to_thread(self.client.has_collection, collection)
|
||||
if not exists:
|
||||
self.ap.logger.warning(f"SeekDB collection '{collection}' not found for deletion")
|
||||
return
|
||||
|
||||
# Get collection
|
||||
if collection not in self._collections:
|
||||
coll = await asyncio.to_thread(self.client.get_collection, collection, embedding_function=None)
|
||||
self._collections[collection] = coll
|
||||
else:
|
||||
coll = self._collections[collection]
|
||||
|
||||
# SeekDB's delete() expects a where clause for filtering
|
||||
# Delete all records where metadata['file_id'] == file_id
|
||||
await asyncio.to_thread(coll.delete, where={'file_id': file_id})
|
||||
|
||||
self.ap.logger.info(f"Deleted embeddings from SeekDB collection '{collection}' with file_id: {file_id}")
|
||||
|
||||
async def delete_collection(self, collection: str):
|
||||
"""Delete the entire collection.
|
||||
|
||||
Args:
|
||||
collection: Collection name
|
||||
"""
|
||||
# Remove from cache
|
||||
if collection in self._collections:
|
||||
del self._collections[collection]
|
||||
if collection in self._collection_configs:
|
||||
del self._collection_configs[collection]
|
||||
|
||||
# Check if collection exists
|
||||
exists = await asyncio.to_thread(self.client.has_collection, collection)
|
||||
if not exists:
|
||||
self.ap.logger.warning(f"SeekDB collection '{collection}' not found for deletion")
|
||||
return
|
||||
|
||||
# Delete collection
|
||||
await asyncio.to_thread(self.client.delete_collection, collection)
|
||||
self.ap.logger.info(f"SeekDB collection '{collection}' deleted")
|
||||
@@ -37,6 +37,17 @@ vdb:
|
||||
host: localhost
|
||||
port: 6333
|
||||
api_key: ''
|
||||
seekdb:
|
||||
mode: embedded # 'embedded' or 'server'
|
||||
# Embedded mode options:
|
||||
path: './data/seekdb'
|
||||
database: 'langbot'
|
||||
# Server mode options (used when mode='server'):
|
||||
host: 'localhost'
|
||||
port: 2881
|
||||
user: 'root'
|
||||
password: ''
|
||||
tenant: '' # Optional, for OceanBase server
|
||||
milvus:
|
||||
uri: 'http://127.0.0.1:19530'
|
||||
token: ''
|
||||
|
||||
Reference in New Issue
Block a user