feat: implement summarization (#14)

2026-04-03 06:12:14 +08:00 · 2026-01-19 16:17:31 +08:00
parent 1352b0e0ba
commit f0a2381bd5
8 changed files with 555 additions and 5 deletions
--- a/backend/CLAUDE.md
+++ b/backend/CLAUDE.md
@@ -81,14 +81,27 @@ Config values starting with `$` are resolved as environment variables (e.g., `$O
  - Local sandbox: `/mnt/skills` → `/path/to/deer-flow/skills`
  - Docker sandbox: Automatically mounted as volume

+**Middleware System**
+- Custom middlewares in `src/agents/middlewares/`: Title generation, thread data, clarification, etc.
+- `SummarizationMiddleware` from LangChain automatically condenses conversation history when token limits are approached
+- Configured in `config.yaml` under `summarization` key with trigger/keep thresholds
+- Middlewares are registered in `src/agents/lead_agent/agent.py` with execution order:
+  1. `ThreadDataMiddleware` - Initializes thread context
+  2. `SandboxMiddleware` - Manages sandbox lifecycle
+  3. `SummarizationMiddleware` - Reduces context when limits are approached (if enabled)
+  4. `TitleMiddleware` - Generates conversation titles
+  5. `ClarificationMiddleware` - Handles clarification requests (must be last)
+
 ### Config Schema

-Models, tools, sandbox providers, and skills are configured in `config.yaml`:
+Models, tools, sandbox providers, skills, and middleware settings are configured in `config.yaml`:
 - `models[]`: LLM configurations with `use` class path
 - `tools[]`: Tool configurations with `use` variable path and `group`
 - `sandbox.use`: Sandbox provider class path
 - `skills.path`: Host path to skills directory (optional, default: `../skills`)
 - `skills.container_path`: Container mount path (default: `/mnt/skills`)
+- `title`: Automatic thread title generation configuration
+- `summarization`: Automatic conversation summarization configuration

 ## Code Style

--- a/backend/docs/TODO.md
+++ b/backend/docs/TODO.md
@@ -4,7 +4,9 @@

 [x] Launch the sandbox only after the first file system or bash tool is called
 [ ] Pooling the sandbox resources to reduce the number of sandbox containers
-[ ] Add Clarification Process for the whole process
+[x] Add Clarification Process for the whole process
+[x] Implement Context Summarization Mechanism to avoid context explosion\
+[ ] Integrate MCP

 ## Issues

--- a/backend/docs/summarization.md
+++ b/backend/docs/summarization.md
@@ -0,0 +1,353 @@
+# Conversation Summarization
+
+DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context.
+
+## Overview
+
+The summarization feature uses LangChain's `SummarizationMiddleware` to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it:
+
+1. Monitors message token counts in real-time
+2. Triggers summarization when thresholds are met
+3. Keeps recent messages intact while summarizing older exchanges
+4. Maintains AI/Tool message pairs together for context continuity
+5. Injects the summary back into the conversation
+
+## Configuration
+
+Summarization is configured in `config.yaml` under the `summarization` key:
+
+```yaml
+summarization:
+  enabled: true
+  model_name: null  # Use default model or specify a lightweight model
+
+  # Trigger conditions (OR logic - any condition triggers summarization)
+  trigger:
+    - type: tokens
+      value: 4000
+    # Additional triggers (optional)
+    # - type: messages
+    #   value: 50
+    # - type: fraction
+    #   value: 0.8  # 80% of model's max input tokens
+
+  # Context retention policy
+  keep:
+    type: messages
+    value: 20
+
+  # Token trimming for summarization call
+  trim_tokens_to_summarize: 4000
+
+  # Custom summary prompt (optional)
+  summary_prompt: null
+```
+
+### Configuration Options
+
+#### `enabled`
+- **Type**: Boolean
+- **Default**: `false`
+- **Description**: Enable or disable automatic summarization
+
+#### `model_name`
+- **Type**: String or null
+- **Default**: `null` (uses default model)
+- **Description**: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like `gpt-4o-mini` or equivalent.
+
+#### `trigger`
+- **Type**: Single `ContextSize` or list of `ContextSize` objects
+- **Required**: At least one trigger must be specified when enabled
+- **Description**: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met.
+
+**ContextSize Types:**
+
+1. **Token-based trigger**: Activates when token count reaches the specified value
+   ```yaml
+   trigger:
+     type: tokens
+     value: 4000
+   ```
+
+2. **Message-based trigger**: Activates when message count reaches the specified value
+   ```yaml
+   trigger:
+     type: messages
+     value: 50
+   ```
+
+3. **Fraction-based trigger**: Activates when token usage reaches a percentage of the model's maximum input tokens
+   ```yaml
+   trigger:
+     type: fraction
+     value: 0.8  # 80% of max input tokens
+   ```
+
+**Multiple Triggers:**
+```yaml
+trigger:
+  - type: tokens
+    value: 4000
+  - type: messages
+    value: 50
+```
+
+#### `keep`
+- **Type**: `ContextSize` object
+- **Default**: `{type: messages, value: 20}`
+- **Description**: Specifies how much recent conversation history to preserve after summarization.
+
+**Examples:**
+```yaml
+# Keep most recent 20 messages
+keep:
+  type: messages
+  value: 20
+
+# Keep most recent 3000 tokens
+keep:
+  type: tokens
+  value: 3000
+
+# Keep most recent 30% of model's max input tokens
+keep:
+  type: fraction
+  value: 0.3
+```
+
+#### `trim_tokens_to_summarize`
+- **Type**: Integer or null
+- **Default**: `4000`
+- **Description**: Maximum tokens to include when preparing messages for the summarization call itself. Set to `null` to skip trimming (not recommended for very long conversations).
+
+#### `summary_prompt`
+- **Type**: String or null
+- **Default**: `null` (uses LangChain's default prompt)
+- **Description**: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context.
+
+**Default Prompt Behavior:**
+The default LangChain prompt instructs the model to:
+- Extract highest quality/most relevant context
+- Focus on information critical to the overall goal
+- Avoid repeating completed actions
+- Return only the extracted context
+
+## How It Works
+
+### Summarization Flow
+
+1. **Monitoring**: Before each model call, the middleware counts tokens in the message history
+2. **Trigger Check**: If any configured threshold is met, summarization is triggered
+3. **Message Partitioning**: Messages are split into:
+   - Messages to summarize (older messages beyond the `keep` threshold)
+   - Messages to preserve (recent messages within the `keep` threshold)
+4. **Summary Generation**: The model generates a concise summary of the older messages
+5. **Context Replacement**: The message history is updated:
+   - All old messages are removed
+   - A single summary message is added
+   - Recent messages are preserved
+6. **AI/Tool Pair Protection**: The system ensures AI messages and their corresponding tool messages stay together
+
+### Token Counting
+
+- Uses approximate token counting based on character count
+- For Anthropic models: ~3.3 characters per token
+- For other models: Uses LangChain's default estimation
+- Can be customized with a custom `token_counter` function
+
+### Message Preservation
+
+The middleware intelligently preserves message context:
+
+- **Recent Messages**: Always kept intact based on `keep` configuration
+- **AI/Tool Pairs**: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together
+- **Summary Format**: Summary is injected as a HumanMessage with the format:
+  ```
+  Here is a summary of the conversation to date:
+
+  [Generated summary text]
+  ```
+
+## Best Practices
+
+### Choosing Trigger Thresholds
+
+1. **Token-based triggers**: Recommended for most use cases
+   - Set to 60-80% of your model's context window
+   - Example: For 8K context, use 4000-6000 tokens
+
+2. **Message-based triggers**: Useful for controlling conversation length
+   - Good for applications with many short messages
+   - Example: 50-100 messages depending on average message length
+
+3. **Fraction-based triggers**: Ideal when using multiple models
+   - Automatically adapts to each model's capacity
+   - Example: 0.8 (80% of model's max input tokens)
+
+### Choosing Retention Policy (`keep`)
+
+1. **Message-based retention**: Best for most scenarios
+   - Preserves natural conversation flow
+   - Recommended: 15-25 messages
+
+2. **Token-based retention**: Use when precise control is needed
+   - Good for managing exact token budgets
+   - Recommended: 2000-4000 tokens
+
+3. **Fraction-based retention**: For multi-model setups
+   - Automatically scales with model capacity
+   - Recommended: 0.2-0.4 (20-40% of max input)
+
+### Model Selection
+
+- **Recommended**: Use a lightweight, cost-effective model for summaries
+  - Examples: `gpt-4o-mini`, `claude-haiku`, or equivalent
+  - Summaries don't require the most powerful models
+  - Significant cost savings on high-volume applications
+
+- **Default**: If `model_name` is `null`, uses the default model
+  - May be more expensive but ensures consistency
+  - Good for simple setups
+
+### Optimization Tips
+
+1. **Balance triggers**: Combine token and message triggers for robust handling
+   ```yaml
+   trigger:
+     - type: tokens
+       value: 4000
+     - type: messages
+       value: 50
+   ```
+
+2. **Conservative retention**: Keep more messages initially, adjust based on performance
+   ```yaml
+   keep:
+     type: messages
+     value: 25  # Start higher, reduce if needed
+   ```
+
+3. **Trim strategically**: Limit tokens sent to summarization model
+   ```yaml
+   trim_tokens_to_summarize: 4000  # Prevents expensive summarization calls
+   ```
+
+4. **Monitor and iterate**: Track summary quality and adjust configuration
+
+## Troubleshooting
+
+### Summary Quality Issues
+
+**Problem**: Summaries losing important context
+
+**Solutions**:
+1. Increase `keep` value to preserve more messages
+2. Decrease trigger thresholds to summarize earlier
+3. Customize `summary_prompt` to emphasize key information
+4. Use a more capable model for summarization
+
+### Performance Issues
+
+**Problem**: Summarization calls taking too long
+
+**Solutions**:
+1. Use a faster model for summaries (e.g., `gpt-4o-mini`)
+2. Reduce `trim_tokens_to_summarize` to send less context
+3. Increase trigger thresholds to summarize less frequently
+
+### Token Limit Errors
+
+**Problem**: Still hitting token limits despite summarization
+
+**Solutions**:
+1. Lower trigger thresholds to summarize earlier
+2. Reduce `keep` value to preserve fewer messages
+3. Check if individual messages are very large
+4. Consider using fraction-based triggers
+
+## Implementation Details
+
+### Code Structure
+
+- **Configuration**: `src/config/summarization_config.py`
+- **Integration**: `src/agents/lead_agent/agent.py`
+- **Middleware**: Uses `langchain.agents.middleware.SummarizationMiddleware`
+
+### Middleware Order
+
+Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification:
+
+1. ThreadDataMiddleware
+2. SandboxMiddleware
+3. **SummarizationMiddleware** ← Runs here
+4. TitleMiddleware
+5. ClarificationMiddleware
+
+### State Management
+
+- Summarization is stateless - configuration is loaded once at startup
+- Summaries are added as regular messages in the conversation history
+- The checkpointer persists the summarized history automatically
+
+## Example Configurations
+
+### Minimal Configuration
+```yaml
+summarization:
+  enabled: true
+  trigger:
+    type: tokens
+    value: 4000
+  keep:
+    type: messages
+    value: 20
+```
+
+### Production Configuration
+```yaml
+summarization:
+  enabled: true
+  model_name: gpt-4o-mini  # Lightweight model for cost efficiency
+  trigger:
+    - type: tokens
+      value: 6000
+    - type: messages
+      value: 75
+  keep:
+    type: messages
+    value: 25
+  trim_tokens_to_summarize: 5000
+```
+
+### Multi-Model Configuration
+```yaml
+summarization:
+  enabled: true
+  model_name: gpt-4o-mini
+  trigger:
+    type: fraction
+    value: 0.7  # 70% of model's max input
+  keep:
+    type: fraction
+    value: 0.3  # Keep 30% of max input
+  trim_tokens_to_summarize: 4000
+```
+
+### Conservative Configuration (High Quality)
+```yaml
+summarization:
+  enabled: true
+  model_name: gpt-4  # Use full model for high-quality summaries
+  trigger:
+    type: tokens
+    value: 8000
+  keep:
+    type: messages
+    value: 40  # Keep more context
+  trim_tokens_to_summarize: null  # No trimming
+```
+
+## References
+
+- [LangChain Summarization Middleware Documentation](https://docs.langchain.com/oss/python/langchain/middleware/built-in#summarization)
+- [LangChain Source Code](https://github.com/langchain-ai/langchain)
--- a/backend/src/agents/lead_agent/agent.py
+++ b/backend/src/agents/lead_agent/agent.py
@@ -1,4 +1,5 @@
 from langchain.agents import create_agent
+from langchain.agents.middleware import SummarizationMiddleware
 from langchain_core.runnables import RunnableConfig

 from src.agents.lead_agent.prompt import apply_prompt_template
@@ -6,12 +7,66 @@ from src.agents.middlewares.clarification_middleware import ClarificationMiddlew
 from src.agents.middlewares.thread_data_middleware import ThreadDataMiddleware
 from src.agents.middlewares.title_middleware import TitleMiddleware
 from src.agents.thread_state import ThreadState
+from src.config.summarization_config import get_summarization_config
 from src.models import create_chat_model
 from src.sandbox.middleware import SandboxMiddleware

+
+def _create_summarization_middleware() -> SummarizationMiddleware | None:
+    """Create and configure the summarization middleware from config."""
+    config = get_summarization_config()
+
+    if not config.enabled:
+        return None
+
+    # Prepare trigger parameter
+    trigger = None
+    if config.trigger is not None:
+        if isinstance(config.trigger, list):
+            trigger = [t.to_tuple() for t in config.trigger]
+        else:
+            trigger = config.trigger.to_tuple()
+
+    # Prepare keep parameter
+    keep = config.keep.to_tuple()
+
+    # Prepare model parameter
+    if config.model_name:
+        model = config.model_name
+    else:
+        # Use a lightweight model for summarization to save costs
+        # Falls back to default model if not explicitly specified
+        model = create_chat_model(thinking_enabled=False)
+
+    # Prepare kwargs
+    kwargs = {
+        "model": model,
+        "trigger": trigger,
+        "keep": keep,
+    }
+
+    if config.trim_tokens_to_summarize is not None:
+        kwargs["trim_tokens_to_summarize"] = config.trim_tokens_to_summarize
+
+    if config.summary_prompt is not None:
+        kwargs["summary_prompt"] = config.summary_prompt
+
+    return SummarizationMiddleware(**kwargs)
+
+
 # ThreadDataMiddleware must be before SandboxMiddleware to ensure thread_id is available
+# SummarizationMiddleware should be early to reduce context before other processing
 # ClarificationMiddleware should be last to intercept clarification requests after model calls
-middlewares = [ThreadDataMiddleware(), SandboxMiddleware(), TitleMiddleware(), ClarificationMiddleware()]
+def _build_middlewares():
+    middlewares = [ThreadDataMiddleware(), SandboxMiddleware()]
+
+    # Add summarization middleware if enabled
+    summarization_middleware = _create_summarization_middleware()
+    if summarization_middleware is not None:
+        middlewares.append(summarization_middleware)
+
+    middlewares.extend([TitleMiddleware(), ClarificationMiddleware()])
+    return middlewares


 def make_lead_agent(config: RunnableConfig):
@@ -24,7 +79,7 @@ def make_lead_agent(config: RunnableConfig):
    return create_agent(
        model=create_chat_model(name=model_name, thinking_enabled=thinking_enabled),
        tools=get_available_tools(),
-        middleware=middlewares,
+        middleware=_build_middlewares(),
        system_prompt=apply_prompt_template(),
        state_schema=ThreadState,
    )
--- a/backend/src/agents/lead_agent/prompt.py
+++ b/backend/src/agents/lead_agent/prompt.py
@@ -89,7 +89,7 @@ You: "Deploying to staging..." [proceed]
 You have access to skills that provide optimized workflows for specific tasks. Each skill contains best practices, frameworks, and references to additional resources.

 **Progressive Loading Pattern:**
-1. When a user query matches a skill's use case, immediately call `view` on the skill's main file using the path attribute provided in the skill tag below
+1. When a user query matches a skill's use case, immediately call `read_file` on the skill's main file using the path attribute provided in the skill tag below
 2. Read and understand the skill's workflow and instructions
 3. The skill file contains references to external resources under the same folder
 4. Load referenced resources only when needed during execution
--- a/backend/src/config/app_config.py
+++ b/backend/src/config/app_config.py
@@ -9,6 +9,7 @@ from pydantic import BaseModel, ConfigDict, Field
 from src.config.model_config import ModelConfig
 from src.config.sandbox_config import SandboxConfig
 from src.config.skills_config import SkillsConfig
+from src.config.summarization_config import load_summarization_config_from_dict
 from src.config.title_config import load_title_config_from_dict
 from src.config.tool_config import ToolConfig, ToolGroupConfig

@@ -75,6 +76,10 @@ class AppConfig(BaseModel):
        if "title" in config_data:
            load_title_config_from_dict(config_data["title"])

+        # Load summarization config if present
+        if "summarization" in config_data:
+            load_summarization_config_from_dict(config_data["summarization"])
+
        result = cls.model_validate(config_data)
        return result

--- a/backend/src/config/summarization_config.py
+++ b/backend/src/config/summarization_config.py
@@ -0,0 +1,74 @@
+"""Configuration for conversation summarization."""
+
+from typing import Literal
+
+from pydantic import BaseModel, Field
+
+ContextSizeType = Literal["fraction", "tokens", "messages"]
+
+
+class ContextSize(BaseModel):
+    """Context size specification for trigger or keep parameters."""
+
+    type: ContextSizeType = Field(description="Type of context size specification")
+    value: int | float = Field(description="Value for the context size specification")
+
+    def to_tuple(self) -> tuple[ContextSizeType, int | float]:
+        """Convert to tuple format expected by SummarizationMiddleware."""
+        return (self.type, self.value)
+
+
+class SummarizationConfig(BaseModel):
+    """Configuration for automatic conversation summarization."""
+
+    enabled: bool = Field(
+        default=False,
+        description="Whether to enable automatic conversation summarization",
+    )
+    model_name: str | None = Field(
+        default=None,
+        description="Model name to use for summarization (None = use a lightweight model)",
+    )
+    trigger: ContextSize | list[ContextSize] | None = Field(
+        default=None,
+        description="One or more thresholds that trigger summarization. When any threshold is met, summarization runs. "
+        "Examples: {'type': 'messages', 'value': 50} triggers at 50 messages, "
+        "{'type': 'tokens', 'value': 4000} triggers at 4000 tokens, "
+        "{'type': 'fraction', 'value': 0.8} triggers at 80% of model's max input tokens",
+    )
+    keep: ContextSize = Field(
+        default_factory=lambda: ContextSize(type="messages", value=20),
+        description="Context retention policy after summarization. Specifies how much history to preserve. "
+        "Examples: {'type': 'messages', 'value': 20} keeps 20 messages, "
+        "{'type': 'tokens', 'value': 3000} keeps 3000 tokens, "
+        "{'type': 'fraction', 'value': 0.3} keeps 30% of model's max input tokens",
+    )
+    trim_tokens_to_summarize: int | None = Field(
+        default=4000,
+        description="Maximum tokens to keep when preparing messages for summarization. Pass null to skip trimming.",
+    )
+    summary_prompt: str | None = Field(
+        default=None,
+        description="Custom prompt template for generating summaries. If not provided, uses the default LangChain prompt.",
+    )
+
+
+# Global configuration instance
+_summarization_config: SummarizationConfig = SummarizationConfig()
+
+
+def get_summarization_config() -> SummarizationConfig:
+    """Get the current summarization configuration."""
+    return _summarization_config
+
+
+def set_summarization_config(config: SummarizationConfig) -> None:
+    """Set the summarization configuration."""
+    global _summarization_config
+    _summarization_config = config
+
+
+def load_summarization_config_from_dict(config_dict: dict) -> None:
+    """Load summarization configuration from a dictionary."""
+    global _summarization_config
+    _summarization_config = SummarizationConfig(**config_dict)
--- a/config.example.yaml
+++ b/config.example.yaml
@@ -174,3 +174,51 @@ title:
  max_words: 6
  max_chars: 60
  model_name: null  # Use default model (first model in models list)
+
+# ============================================================================
+# Summarization Configuration
+# ============================================================================
+# Automatically summarize conversation history when token limits are approached
+# This helps maintain context in long conversations without exceeding model limits
+
+summarization:
+  enabled: true
+
+  # Model to use for summarization (null = use default model)
+  # Recommended: Use a lightweight, cost-effective model like "gpt-4o-mini" or similar
+  model_name: null
+
+  # Trigger conditions - at least one required
+  # Summarization runs when ANY threshold is met (OR logic)
+  # You can specify a single trigger or a list of triggers
+  trigger:
+    # Trigger when token count reaches 4000
+    - type: tokens
+      value: 4000
+    # Uncomment to also trigger when message count reaches 50
+    # - type: messages
+    #   value: 50
+    # Uncomment to trigger when 80% of model's max input tokens is reached
+    # - type: fraction
+    #   value: 0.8
+
+  # Context retention policy after summarization
+  # Specifies how much recent history to preserve
+  keep:
+    # Keep the most recent 20 messages (recommended)
+    type: messages
+    value: 20
+    # Alternative: Keep specific token count
+    # type: tokens
+    # value: 3000
+    # Alternative: Keep percentage of model's max input tokens
+    # type: fraction
+    # value: 0.3
+
+  # Maximum tokens to keep when preparing messages for summarization
+  # Set to null to skip trimming (not recommended for very long conversations)
+  trim_tokens_to_summarize: 4000
+
+  # Custom summary prompt template (null = use default LangChain prompt)
+  # The prompt should guide the model to extract important context
+  summary_prompt: null