deer-flow/backend/docs/summarization.md

# Conversation Summarization

DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context.

## Overview

The summarization feature uses LangChain's `SummarizationMiddleware` to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it:

1. Monitors message token counts in real-time
2. Triggers summarization when thresholds are met
3. Keeps recent messages intact while summarizing older exchanges
4. Maintains AI/Tool message pairs together for context continuity
5. Injects the summary back into the conversation

## Configuration

Summarization is configured in `config.yaml` under the `summarization` key:

```yaml
summarization:
  enabled: true
  model_name: null  # Use default model or specify a lightweight model

  # Trigger conditions (OR logic - any condition triggers summarization)
  trigger:
    - type: tokens
      value: 4000
    # Additional triggers (optional)
    # - type: messages
    #   value: 50
    # - type: fraction
    #   value: 0.8  # 80% of model's max input tokens

  # Context retention policy
  keep:
    type: messages
    value: 20

  # Token trimming for summarization call
  trim_tokens_to_summarize: 4000

  # Custom summary prompt (optional)
  summary_prompt: null
```

### Configuration Options

#### `enabled`
- **Type**: Boolean
- **Default**: `false`
- **Description**: Enable or disable automatic summarization

#### `model_name`
- **Type**: String or null
- **Default**: `null` (uses default model)
- **Description**: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like `gpt-4o-mini` or equivalent.

#### `trigger`
- **Type**: Single `ContextSize` or list of `ContextSize` objects
- **Required**: At least one trigger must be specified when enabled
- **Description**: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met.

**ContextSize Types:**

1. **Token-based trigger**: Activates when token count reaches the specified value
   ```yaml
   trigger:
     type: tokens
     value: 4000
   ```

2. **Message-based trigger**: Activates when message count reaches the specified value
   ```yaml
   trigger:
     type: messages
     value: 50
   ```

3. **Fraction-based trigger**: Activates when token usage reaches a percentage of the model's maximum input tokens
   ```yaml
   trigger:
     type: fraction
     value: 0.8  # 80% of max input tokens
   ```

**Multiple Triggers:**
```yaml
trigger:
  - type: tokens
    value: 4000
  - type: messages
    value: 50
```

#### `keep`
- **Type**: `ContextSize` object
- **Default**: `{type: messages, value: 20}`
- **Description**: Specifies how much recent conversation history to preserve after summarization.

**Examples:**
```yaml
# Keep most recent 20 messages
keep:
  type: messages
  value: 20

# Keep most recent 3000 tokens
keep:
  type: tokens
  value: 3000

# Keep most recent 30% of model's max input tokens
keep:
  type: fraction
  value: 0.3
```

#### `trim_tokens_to_summarize`
- **Type**: Integer or null
- **Default**: `4000`
- **Description**: Maximum tokens to include when preparing messages for the summarization call itself. Set to `null` to skip trimming (not recommended for very long conversations).

#### `summary_prompt`
- **Type**: String or null
- **Default**: `null` (uses LangChain's default prompt)
- **Description**: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context.

**Default Prompt Behavior:**
The default LangChain prompt instructs the model to:
- Extract highest quality/most relevant context
- Focus on information critical to the overall goal
- Avoid repeating completed actions
- Return only the extracted context

## How It Works

### Summarization Flow

1. **Monitoring**: Before each model call, the middleware counts tokens in the message history
2. **Trigger Check**: If any configured threshold is met, summarization is triggered
3. **Message Partitioning**: Messages are split into:
   - Messages to summarize (older messages beyond the `keep` threshold)
   - Messages to preserve (recent messages within the `keep` threshold)
4. **Summary Generation**: The model generates a concise summary of the older messages
5. **Context Replacement**: The message history is updated:
   - All old messages are removed
   - A single summary message is added
   - Recent messages are preserved
6. **AI/Tool Pair Protection**: The system ensures AI messages and their corresponding tool messages stay together

### Token Counting

- Uses approximate token counting based on character count
- For Anthropic models: ~3.3 characters per token
- For other models: Uses LangChain's default estimation
- Can be customized with a custom `token_counter` function

### Message Preservation

The middleware intelligently preserves message context:

- **Recent Messages**: Always kept intact based on `keep` configuration
- **AI/Tool Pairs**: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together
- **Summary Format**: Summary is injected as a HumanMessage with the format:
  ```
  Here is a summary of the conversation to date:

  [Generated summary text]
  ```

## Best Practices

### Choosing Trigger Thresholds

1. **Token-based triggers**: Recommended for most use cases
   - Set to 60-80% of your model's context window
   - Example: For 8K context, use 4000-6000 tokens

2. **Message-based triggers**: Useful for controlling conversation length
   - Good for applications with many short messages
   - Example: 50-100 messages depending on average message length

3. **Fraction-based triggers**: Ideal when using multiple models
   - Automatically adapts to each model's capacity
   - Example: 0.8 (80% of model's max input tokens)

### Choosing Retention Policy (`keep`)

1. **Message-based retention**: Best for most scenarios
   - Preserves natural conversation flow
   - Recommended: 15-25 messages

2. **Token-based retention**: Use when precise control is needed
   - Good for managing exact token budgets
   - Recommended: 2000-4000 tokens

3. **Fraction-based retention**: For multi-model setups
   - Automatically scales with model capacity
   - Recommended: 0.2-0.4 (20-40% of max input)

### Model Selection

- **Recommended**: Use a lightweight, cost-effective model for summaries
  - Examples: `gpt-4o-mini`, `claude-haiku`, or equivalent
  - Summaries don't require the most powerful models
  - Significant cost savings on high-volume applications

- **Default**: If `model_name` is `null`, uses the default model
  - May be more expensive but ensures consistency
  - Good for simple setups

### Optimization Tips

1. **Balance triggers**: Combine token and message triggers for robust handling
   ```yaml
   trigger:
     - type: tokens
       value: 4000
     - type: messages
       value: 50
   ```

2. **Conservative retention**: Keep more messages initially, adjust based on performance
   ```yaml
   keep:
     type: messages
     value: 25  # Start higher, reduce if needed
   ```

3. **Trim strategically**: Limit tokens sent to summarization model
   ```yaml
   trim_tokens_to_summarize: 4000  # Prevents expensive summarization calls
   ```

4. **Monitor and iterate**: Track summary quality and adjust configuration

## Troubleshooting

### Summary Quality Issues

**Problem**: Summaries losing important context

**Solutions**:
1. Increase `keep` value to preserve more messages
2. Decrease trigger thresholds to summarize earlier
3. Customize `summary_prompt` to emphasize key information
4. Use a more capable model for summarization

### Performance Issues

**Problem**: Summarization calls taking too long

**Solutions**:
1. Use a faster model for summaries (e.g., `gpt-4o-mini`)
2. Reduce `trim_tokens_to_summarize` to send less context
3. Increase trigger thresholds to summarize less frequently

### Token Limit Errors

**Problem**: Still hitting token limits despite summarization

**Solutions**:
1. Lower trigger thresholds to summarize earlier
2. Reduce `keep` value to preserve fewer messages
3. Check if individual messages are very large
4. Consider using fraction-based triggers

## Implementation Details

### Code Structure

- **Configuration**: `packages/harness/deerflow/config/summarization_config.py`
- **Integration**: `packages/harness/deerflow/agents/lead_agent/agent.py`
- **Middleware**: Uses `langchain.agents.middleware.SummarizationMiddleware`

### Middleware Order

Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification:

1. ThreadDataMiddleware
2. SandboxMiddleware
3. **SummarizationMiddleware** ← Runs here
4. TitleMiddleware
5. ClarificationMiddleware

### State Management

- Summarization is stateless - configuration is loaded once at startup
- Summaries are added as regular messages in the conversation history
- The checkpointer persists the summarized history automatically

## Example Configurations

### Minimal Configuration
```yaml
summarization:
  enabled: true
  trigger:
    type: tokens
    value: 4000
  keep:
    type: messages
    value: 20
```

### Production Configuration
```yaml
summarization:
  enabled: true
  model_name: gpt-4o-mini  # Lightweight model for cost efficiency
  trigger:
    - type: tokens
      value: 6000
    - type: messages
      value: 75
  keep:
    type: messages
    value: 25
  trim_tokens_to_summarize: 5000
```

### Multi-Model Configuration
```yaml
summarization:
  enabled: true
  model_name: gpt-4o-mini
  trigger:
    type: fraction
    value: 0.7  # 70% of model's max input
  keep:
    type: fraction
    value: 0.3  # Keep 30% of max input
  trim_tokens_to_summarize: 4000
```

### Conservative Configuration (High Quality)
```yaml
summarization:
  enabled: true
  model_name: gpt-4  # Use full model for high-quality summaries
  trigger:
    type: tokens
    value: 8000
  keep:
    type: messages
    value: 40  # Keep more context
  trim_tokens_to_summarize: null  # No trimming
```

## References

- [LangChain Summarization Middleware Documentation](https://docs.langchain.com/oss/python/langchain/middleware/built-in#summarization)
- [LangChain Source Code](https://github.com/langchain-ai/langchain)