mirror of
https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web.git
synced 2025-11-13 04:33:42 +08:00
完善mistral tool use功能
This commit is contained in:
258
docs/bedrock-response-format.md
Normal file
258
docs/bedrock-response-format.md
Normal file
@@ -0,0 +1,258 @@
|
||||
# Understanding Bedrock Response Format
|
||||
|
||||
The AWS Bedrock streaming response format consists of multiple Server-Sent Events (SSE) chunks. Each chunk follows this structure:
|
||||
|
||||
```
|
||||
:event-type chunk
|
||||
:content-type application/json
|
||||
:message-type event
|
||||
{"bytes":"base64_encoded_data","p":"signature"}
|
||||
```
|
||||
|
||||
## Model-Specific Response Formats
|
||||
|
||||
### Claude 3 Format
|
||||
|
||||
When using Claude 3 models (e.g., claude-3-haiku-20240307), the decoded messages include:
|
||||
|
||||
1. **message_start**
|
||||
```json
|
||||
{
|
||||
"type": "message_start",
|
||||
"message": {
|
||||
"id": "msg_bdrk_01A6sahWac4XVTR9sX3rgvsZ",
|
||||
"type": "message",
|
||||
"role": "assistant",
|
||||
"model": "claude-3-haiku-20240307",
|
||||
"content": [],
|
||||
"stop_reason": null,
|
||||
"stop_sequence": null,
|
||||
"usage": {
|
||||
"input_tokens": 8,
|
||||
"output_tokens": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. **content_block_start**
|
||||
```json
|
||||
{
|
||||
"type": "content_block_start",
|
||||
"index": 0,
|
||||
"content_block": {
|
||||
"type": "text",
|
||||
"text": ""
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. **content_block_delta**
|
||||
```json
|
||||
{
|
||||
"type": "content_block_delta",
|
||||
"index": 0,
|
||||
"delta": {
|
||||
"type": "text_delta",
|
||||
"text": "Hello"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Mistral Format
|
||||
|
||||
When using Mistral models (e.g., mistral-large-2407), the decoded messages have a different structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "b0098812-0ad9-42da-9f17-a5e2f554eb6b",
|
||||
"object": "chat.completion.chunk",
|
||||
"created": 1732582566,
|
||||
"model": "mistral-large-2407",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"logprobs": null,
|
||||
"context_logits": null,
|
||||
"generation_logits": null,
|
||||
"message": {
|
||||
"role": null,
|
||||
"content": "Hello",
|
||||
"tool_calls": null,
|
||||
"index": null,
|
||||
"tool_call_id": null
|
||||
},
|
||||
"stop_reason": null
|
||||
}],
|
||||
"usage": null,
|
||||
"p": null
|
||||
}
|
||||
```
|
||||
|
||||
### Llama Format
|
||||
|
||||
When using Llama models (3.1 or 3.2), the decoded messages use a simpler structure focused on generation tokens:
|
||||
|
||||
```json
|
||||
{
|
||||
"generation": "Hello",
|
||||
"prompt_token_count": null,
|
||||
"generation_token_count": 2,
|
||||
"stop_reason": null
|
||||
}
|
||||
```
|
||||
|
||||
Each chunk contains:
|
||||
- generation: The generated text piece
|
||||
- prompt_token_count: Token count of the input (only present in first chunk)
|
||||
- generation_token_count: Running count of generated tokens
|
||||
- stop_reason: Indicates completion (null until final chunk)
|
||||
|
||||
First chunk example (includes prompt_token_count):
|
||||
```json
|
||||
{
|
||||
"generation": "\n\n",
|
||||
"prompt_token_count": 10,
|
||||
"generation_token_count": 1,
|
||||
"stop_reason": null
|
||||
}
|
||||
```
|
||||
|
||||
### Titan Text Format
|
||||
|
||||
When using Amazon's Titan models (text or TG1), the response comes as a single chunk with complete text and metrics:
|
||||
|
||||
```json
|
||||
{
|
||||
"outputText": "\nBot: Hello! How can I help you today?",
|
||||
"index": 0,
|
||||
"totalOutputTextTokenCount": 13,
|
||||
"completionReason": "FINISH",
|
||||
"inputTextTokenCount": 3,
|
||||
"amazon-bedrock-invocationMetrics": {
|
||||
"inputTokenCount": 3,
|
||||
"outputTokenCount": 13,
|
||||
"invocationLatency": 833,
|
||||
"firstByteLatency": 833
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Both Titan text and Titan TG1 use the same response format, with only minor differences in token counts and latency values. For example, here's a TG1 response:
|
||||
|
||||
```json
|
||||
{
|
||||
"outputText": "\nBot: Hello! How can I help you?",
|
||||
"index": 0,
|
||||
"totalOutputTextTokenCount": 12,
|
||||
"completionReason": "FINISH",
|
||||
"inputTextTokenCount": 3,
|
||||
"amazon-bedrock-invocationMetrics": {
|
||||
"inputTokenCount": 3,
|
||||
"outputTokenCount": 12,
|
||||
"invocationLatency": 845,
|
||||
"firstByteLatency": 845
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Key fields:
|
||||
- outputText: The complete generated response
|
||||
- totalOutputTextTokenCount: Total tokens in the response
|
||||
- completionReason: Reason for completion (e.g., "FINISH")
|
||||
- inputTextTokenCount: Number of input tokens
|
||||
- amazon-bedrock-invocationMetrics: Detailed performance metrics
|
||||
|
||||
## Model-Specific Completion Metrics
|
||||
|
||||
### Mistral
|
||||
```json
|
||||
{
|
||||
"usage": {
|
||||
"prompt_tokens": 5,
|
||||
"total_tokens": 29,
|
||||
"completion_tokens": 24
|
||||
},
|
||||
"amazon-bedrock-invocationMetrics": {
|
||||
"inputTokenCount": 5,
|
||||
"outputTokenCount": 24,
|
||||
"invocationLatency": 719,
|
||||
"firstByteLatency": 148
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Claude 3
|
||||
Included in the message_delta with stop_reason.
|
||||
|
||||
### Llama
|
||||
Included in the final chunk with stop_reason "stop":
|
||||
```json
|
||||
{
|
||||
"amazon-bedrock-invocationMetrics": {
|
||||
"inputTokenCount": 10,
|
||||
"outputTokenCount": 11,
|
||||
"invocationLatency": 873,
|
||||
"firstByteLatency": 550
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Titan
|
||||
Both Titan text and TG1 include metrics in the single response chunk:
|
||||
```json
|
||||
{
|
||||
"amazon-bedrock-invocationMetrics": {
|
||||
"inputTokenCount": 3,
|
||||
"outputTokenCount": 12,
|
||||
"invocationLatency": 845,
|
||||
"firstByteLatency": 845
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## How the Response is Processed
|
||||
|
||||
1. The raw response is first split into chunks based on SSE format
|
||||
2. For each chunk:
|
||||
- The base64 encoded data is decoded
|
||||
- The JSON is parsed to extract the message content
|
||||
- Based on the model type and message type, different processing is applied:
|
||||
|
||||
### Claude 3 Processing
|
||||
- message_start: Initializes a new message with model info and usage stats
|
||||
- content_block_start: Starts a new content block (text, tool use, etc.)
|
||||
- content_block_delta: Adds incremental content to the current block
|
||||
- message_delta: Updates message metadata
|
||||
|
||||
### Mistral Processing
|
||||
- Each chunk contains a complete message object with choices array
|
||||
- The content is streamed through the message.content field
|
||||
- Final chunk includes token usage and invocation metrics
|
||||
|
||||
### Llama Processing
|
||||
- Each chunk contains a generation field with the text piece
|
||||
- First chunk includes prompt_token_count
|
||||
- Tracks generation progress through generation_token_count
|
||||
- Simple streaming format focused on text generation
|
||||
- Final chunk includes complete metrics
|
||||
|
||||
### Titan Processing
|
||||
- Single chunk response with complete text
|
||||
- No streaming - returns full response at once
|
||||
- Includes comprehensive metrics in the same chunk
|
||||
|
||||
## Handling in Code
|
||||
|
||||
The response is processed by the `transformBedrockStream` function in `app/utils/aws.ts`, which:
|
||||
|
||||
1. Reads the stream chunks
|
||||
2. Parses each chunk using `parseEventData`
|
||||
3. Handles model-specific formats:
|
||||
- For Claude: Processes message_start, content_block_start, content_block_delta
|
||||
- For Mistral: Extracts content from choices[0].message.content
|
||||
- For Llama: Uses the generation field directly
|
||||
- For Titan: Uses the outputText field from the single response
|
||||
4. Transforms the parsed data into a consistent format for the client
|
||||
5. Yields the transformed data as SSE events
|
||||
|
||||
This allows for real-time streaming of the model's response while maintaining a consistent format for the client application, regardless of which model is being used.
|
||||
Reference in New Issue
Block a user