# Conversation Summarization DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context. ## Overview The summarization feature uses LangChain's `SummarizationMiddleware` to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it: 1. Monitors message token counts in real-time 2. Triggers summarization when thresholds are met 3. Keeps recent messages intact while summarizing older exchanges 4. Maintains AI/Tool message pairs together for context continuity 5. Injects the summary back into the conversation ## Configuration Summarization is configured in `config.yaml` under the `summarization` key: ```yaml summarization: enabled: true model_name: null # Use default model or specify a lightweight model # Trigger conditions (OR logic - any condition triggers summarization) trigger: - type: tokens value: 4000 # Additional triggers (optional) # - type: messages # value: 50 # - type: fraction # value: 0.8 # 80% of model's max input tokens # Context retention policy keep: type: messages value: 20 # Token trimming for summarization call trim_tokens_to_summarize: 4000 # Custom summary prompt (optional) summary_prompt: null ``` ### Configuration Options #### `enabled` - **Type**: Boolean - **Default**: `false` - **Description**: Enable or disable automatic summarization #### `model_name` - **Type**: String or null - **Default**: `null` (uses default model) - **Description**: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like `gpt-4o-mini` or equivalent. #### `trigger` - **Type**: Single `ContextSize` or list of `ContextSize` objects - **Required**: At least one trigger must be specified when enabled - **Description**: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met. **ContextSize Types:** 1. **Token-based trigger**: Activates when token count reaches the specified value ```yaml trigger: type: tokens value: 4000 ``` 2. **Message-based trigger**: Activates when message count reaches the specified value ```yaml trigger: type: messages value: 50 ``` 3. **Fraction-based trigger**: Activates when token usage reaches a percentage of the model's maximum input tokens ```yaml trigger: type: fraction value: 0.8 # 80% of max input tokens ``` **Multiple Triggers:** ```yaml trigger: - type: tokens value: 4000 - type: messages value: 50 ``` #### `keep` - **Type**: `ContextSize` object - **Default**: `{type: messages, value: 20}` - **Description**: Specifies how much recent conversation history to preserve after summarization. **Examples:** ```yaml # Keep most recent 20 messages keep: type: messages value: 20 # Keep most recent 3000 tokens keep: type: tokens value: 3000 # Keep most recent 30% of model's max input tokens keep: type: fraction value: 0.3 ``` #### `trim_tokens_to_summarize` - **Type**: Integer or null - **Default**: `4000` - **Description**: Maximum tokens to include when preparing messages for the summarization call itself. Set to `null` to skip trimming (not recommended for very long conversations). #### `summary_prompt` - **Type**: String or null - **Default**: `null` (uses LangChain's default prompt) - **Description**: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context. **Default Prompt Behavior:** The default LangChain prompt instructs the model to: - Extract highest quality/most relevant context - Focus on information critical to the overall goal - Avoid repeating completed actions - Return only the extracted context ## How It Works ### Summarization Flow 1. **Monitoring**: Before each model call, the middleware counts tokens in the message history 2. **Trigger Check**: If any configured threshold is met, summarization is triggered 3. **Message Partitioning**: Messages are split into: - Messages to summarize (older messages beyond the `keep` threshold) - Messages to preserve (recent messages within the `keep` threshold) 4. **Summary Generation**: The model generates a concise summary of the older messages 5. **Context Replacement**: The message history is updated: - All old messages are removed - A single summary message is added - Recent messages are preserved 6. **AI/Tool Pair Protection**: The system ensures AI messages and their corresponding tool messages stay together ### Token Counting - Uses approximate token counting based on character count - For Anthropic models: ~3.3 characters per token - For other models: Uses LangChain's default estimation - Can be customized with a custom `token_counter` function ### Message Preservation The middleware intelligently preserves message context: - **Recent Messages**: Always kept intact based on `keep` configuration - **AI/Tool Pairs**: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together - **Summary Format**: Summary is injected as a HumanMessage with the format: ``` Here is a summary of the conversation to date: [Generated summary text] ``` ## Best Practices ### Choosing Trigger Thresholds 1. **Token-based triggers**: Recommended for most use cases - Set to 60-80% of your model's context window - Example: For 8K context, use 4000-6000 tokens 2. **Message-based triggers**: Useful for controlling conversation length - Good for applications with many short messages - Example: 50-100 messages depending on average message length 3. **Fraction-based triggers**: Ideal when using multiple models - Automatically adapts to each model's capacity - Example: 0.8 (80% of model's max input tokens) ### Choosing Retention Policy (`keep`) 1. **Message-based retention**: Best for most scenarios - Preserves natural conversation flow - Recommended: 15-25 messages 2. **Token-based retention**: Use when precise control is needed - Good for managing exact token budgets - Recommended: 2000-4000 tokens 3. **Fraction-based retention**: For multi-model setups - Automatically scales with model capacity - Recommended: 0.2-0.4 (20-40% of max input) ### Model Selection - **Recommended**: Use a lightweight, cost-effective model for summaries - Examples: `gpt-4o-mini`, `claude-haiku`, or equivalent - Summaries don't require the most powerful models - Significant cost savings on high-volume applications - **Default**: If `model_name` is `null`, uses the default model - May be more expensive but ensures consistency - Good for simple setups ### Optimization Tips 1. **Balance triggers**: Combine token and message triggers for robust handling ```yaml trigger: - type: tokens value: 4000 - type: messages value: 50 ``` 2. **Conservative retention**: Keep more messages initially, adjust based on performance ```yaml keep: type: messages value: 25 # Start higher, reduce if needed ``` 3. **Trim strategically**: Limit tokens sent to summarization model ```yaml trim_tokens_to_summarize: 4000 # Prevents expensive summarization calls ``` 4. **Monitor and iterate**: Track summary quality and adjust configuration ## Troubleshooting ### Summary Quality Issues **Problem**: Summaries losing important context **Solutions**: 1. Increase `keep` value to preserve more messages 2. Decrease trigger thresholds to summarize earlier 3. Customize `summary_prompt` to emphasize key information 4. Use a more capable model for summarization ### Performance Issues **Problem**: Summarization calls taking too long **Solutions**: 1. Use a faster model for summaries (e.g., `gpt-4o-mini`) 2. Reduce `trim_tokens_to_summarize` to send less context 3. Increase trigger thresholds to summarize less frequently ### Token Limit Errors **Problem**: Still hitting token limits despite summarization **Solutions**: 1. Lower trigger thresholds to summarize earlier 2. Reduce `keep` value to preserve fewer messages 3. Check if individual messages are very large 4. Consider using fraction-based triggers ## Implementation Details ### Code Structure - **Configuration**: `packages/harness/deerflow/config/summarization_config.py` - **Integration**: `packages/harness/deerflow/agents/lead_agent/agent.py` - **Middleware**: Uses `langchain.agents.middleware.SummarizationMiddleware` ### Middleware Order Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification: 1. ThreadDataMiddleware 2. SandboxMiddleware 3. **SummarizationMiddleware** ← Runs here 4. TitleMiddleware 5. ClarificationMiddleware ### State Management - Summarization is stateless - configuration is loaded once at startup - Summaries are added as regular messages in the conversation history - The checkpointer persists the summarized history automatically ## Example Configurations ### Minimal Configuration ```yaml summarization: enabled: true trigger: type: tokens value: 4000 keep: type: messages value: 20 ``` ### Production Configuration ```yaml summarization: enabled: true model_name: gpt-4o-mini # Lightweight model for cost efficiency trigger: - type: tokens value: 6000 - type: messages value: 75 keep: type: messages value: 25 trim_tokens_to_summarize: 5000 ``` ### Multi-Model Configuration ```yaml summarization: enabled: true model_name: gpt-4o-mini trigger: type: fraction value: 0.7 # 70% of model's max input keep: type: fraction value: 0.3 # Keep 30% of max input trim_tokens_to_summarize: 4000 ``` ### Conservative Configuration (High Quality) ```yaml summarization: enabled: true model_name: gpt-4 # Use full model for high-quality summaries trigger: type: tokens value: 8000 keep: type: messages value: 40 # Keep more context trim_tokens_to_summarize: null # No trimming ``` ## References - [LangChain Summarization Middleware Documentation](https://docs.langchain.com/oss/python/langchain/middleware/built-in#summarization) - [LangChain Source Code](https://github.com/langchain-ai/langchain)