backend/docs/summarization.md

# Conversation Summarization

DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context.

## Overview

The summarization feature uses LangChain's `SummarizationMiddleware` to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it:

1. Monitors message token counts in real-time
2. Triggers summarization when thresholds are met
3. Keeps recent messages intact while summarizing older exchanges
4. Maintains AI/Tool message pairs together for context continuity
5. Injects the summary back into the conversation

## Configuration

Summarization is configured in `config.yaml` under the `summarization` key:

```yaml
summarization:
  enabled: true
  model_name: null  # Use default model or specify a lightweight model

  # Trigger conditions (OR logic - any condition triggers summarization)
  trigger:
    - type: tokens
      value: 4000
    # Additional triggers (optional)
    # - type: messages
    #   value: 50
    # - type: fraction
    #   value: 0.8  # 80% of model's max input tokens

  # Context retention policy
  keep:
    type: messages
    value: 20

  # Token trimming for summarization call
  trim_tokens_to_summarize: 4000

  # Custom summary prompt (optional)
  summary_prompt: null
```

### Configuration Options

#### `enabled`
- **Type**: Boolean
- **Default**: `false`
- **Description**: Enable or disable automatic summarization

#### `model_name`
- **Type**: String or null
- **Default**: `null` (uses default model)
- **Description**: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like `gpt-4o-mini` or equivalent.

#### `trigger`
- **Type**: Single `ContextSize` or list of `ContextSize` objects
- **Required**: At least one trigger must be specified when enabled
- **Description**: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met.

**ContextSize Types:**

1. **Token-based trigger**: Activates when token count reaches the specified value
   ```yaml
   trigger:
     type: tokens
     value: 4000
   ```

2. **Message-based trigger**: Activates when message count reaches the specified value
   ```yaml
   trigger:
     type: messages
     value: 50
   ```

3. **Fraction-based trigger**: Activates when token usage reaches a percentage of the model's maximum input tokens
   ```yaml
   trigger:
     type: fraction
     value: 0.8  # 80% of max input tokens
   ```

**Multiple Triggers:**
```yaml
trigger:
  - type: tokens
    value: 4000
  - type: messages
    value: 50
```

#### `keep`
- **Type**: `ContextSize` object
- **Default**: `{type: messages, value: 20}`
- **Description**: Specifies how much recent conversation history to preserve after summarization.

**Examples:**
```yaml
# Keep most recent 20 messages
keep:
  type: messages
  value: 20

# Keep most recent 3000 tokens
keep:
  type: tokens
  value: 3000

# Keep most recent 30% of model's max input tokens
keep:
  type: fraction
  value: 0.3
```

#### `trim_tokens_to_summarize`
- **Type**: Integer or null
- **Default**: `4000`
- **Description**: Maximum tokens to include when preparing messages for the summarization call itself. Set to `null` to skip trimming (not recommended for very long conversations).

#### `summary_prompt`
- **Type**: String or null
- **Default**: `null` (uses LangChain's default prompt)
- **Description**: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context.

**Default Prompt Behavior:**
The default LangChain prompt instructs the model to:
- Extract highest quality/most relevant context
- Focus on information critical to the overall goal
- Avoid repeating completed actions
- Return only the extracted context

## How It Works

### Summarization Flow

1. **Monitoring**: Before each model call, the middleware counts tokens in the message history
2. **Trigger Check**: If any configured threshold is met, summarization is triggered
3. **Message Partitioning**: Messages are split into:
   - Messages to summarize (older messages beyond the `keep` threshold)
   - Messages to preserve (recent messages within the `keep` threshold)
4. **Summary Generation**: The model generates a concise summary of the older messages
5. **Context Replacement**: The message history is updated:
   - All old messages are removed
   - A single summary message is added
   - Recent messages are preserved
6. **AI/Tool Pair Protection**: The system ensures AI messages and their corresponding tool messages stay together

### Token Counting

- Uses approximate token counting based on character count
- For Anthropic models: ~3.3 characters per token
- For other models: Uses LangChain's default estimation
- Can be customized with a custom `token_counter` function

### Message Preservation

The middleware intelligently preserves message context:

- **Recent Messages**: Always kept intact based on `keep` configuration
- **AI/Tool Pairs**: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together
- **Summary Format**: Summary is injected as a HumanMessage with the format:
  ```
  Here is a summary of the conversation to date:

  [Generated summary text]
  ```

## Best Practices

### Choosing Trigger Thresholds

1. **Token-based triggers**: Recommended for most use cases
   - Set to 60-80% of your model's context window
   - Example: For 8K context, use 4000-6000 tokens

2. **Message-based triggers**: Useful for controlling conversation length
   - Good for applications with many short messages
   - Example: 50-100 messages depending on average message length

3. **Fraction-based triggers**: Ideal when using multiple models
   - Automatically adapts to each model's capacity
   - Example: 0.8 (80% of model's max input tokens)

### Choosing Retention Policy (`keep`)

1. **Message-based retention**: Best for most scenarios
   - Preserves natural conversation flow
   - Recommended: 15-25 messages

2. **Token-based retention**: Use when precise control is needed
   - Good for managing exact token budgets
   - Recommended: 2000-4000 tokens

3. **Fraction-based retention**: For multi-model setups
   - Automatically scales with model capacity
   - Recommended: 0.2-0.4 (20-40% of max input)

### Model Selection

- **Recommended**: Use a lightweight, cost-effective model for summaries
  - Examples: `gpt-4o-mini`, `claude-haiku`, or equivalent
  - Summaries don't require the most powerful models
  - Significant cost savings on high-volume applications

- **Default**: If `model_name` is `null`, uses the default model
  - May be more expensive but ensures consistency
  - Good for simple setups

### Optimization Tips

1. **Balance triggers**: Combine token and message triggers for robust handling
   ```yaml
   trigger:
     - type: tokens
       value: 4000
     - type: messages
       value: 50
   ```

2. **Conservative retention**: Keep more messages initially, adjust based on performance
   ```yaml
   keep:
     type: messages
     value: 25  # Start higher, reduce if needed
   ```

3. **Trim strategically**: Limit tokens sent to summarization model
   ```yaml
   trim_tokens_to_summarize: 4000  # Prevents expensive summarization calls
   ```

4. **Monitor and iterate**: Track summary quality and adjust configuration

## Troubleshooting

### Summary Quality Issues

**Problem**: Summaries losing important context

**Solutions**:
1. Increase `keep` value to preserve more messages
2. Decrease trigger thresholds to summarize earlier
3. Customize `summary_prompt` to emphasize key information
4. Use a more capable model for summarization

### Performance Issues

**Problem**: Summarization calls taking too long

**Solutions**:
1. Use a faster model for summaries (e.g., `gpt-4o-mini`)
2. Reduce `trim_tokens_to_summarize` to send less context
3. Increase trigger thresholds to summarize less frequently

### Token Limit Errors

**Problem**: Still hitting token limits despite summarization

**Solutions**:
1. Lower trigger thresholds to summarize earlier
2. Reduce `keep` value to preserve fewer messages
3. Check if individual messages are very large
4. Consider using fraction-based triggers

## Implementation Details

### Code Structure

- **Configuration**: `src/config/summarization_config.py`
- **Integration**: `src/agents/lead_agent/agent.py`
- **Middleware**: Uses `langchain.agents.middleware.SummarizationMiddleware`

### Middleware Order

Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification:

1. ThreadDataMiddleware
2. SandboxMiddleware
3. **SummarizationMiddleware** ← Runs here
4. TitleMiddleware
5. ClarificationMiddleware

### State Management

- Summarization is stateless - configuration is loaded once at startup
- Summaries are added as regular messages in the conversation history
- The checkpointer persists the summarized history automatically

## Example Configurations

### Minimal Configuration
```yaml
summarization:
  enabled: true
  trigger:
    type: tokens
    value: 4000
  keep:
    type: messages
    value: 20
```

### Production Configuration
```yaml
summarization:
  enabled: true
  model_name: gpt-4o-mini  # Lightweight model for cost efficiency
  trigger:
    - type: tokens
      value: 6000
    - type: messages
      value: 75
  keep:
    type: messages
    value: 25
  trim_tokens_to_summarize: 5000
```

### Multi-Model Configuration
```yaml
summarization:
  enabled: true
  model_name: gpt-4o-mini
  trigger:
    type: fraction
    value: 0.7  # 70% of model's max input
  keep:
    type: fraction
    value: 0.3  # Keep 30% of max input
  trim_tokens_to_summarize: 4000
```

### Conservative Configuration (High Quality)
```yaml
summarization:
  enabled: true
  model_name: gpt-4  # Use full model for high-quality summaries
  trigger:
    type: tokens
    value: 8000
  keep:
    type: messages
    value: 40  # Keep more context
  trim_tokens_to_summarize: null  # No trimming
```

## References

- [LangChain Summarization Middleware Documentation](https://docs.langchain.com/oss/python/langchain/middleware/built-in#summarization)
- [LangChain Source Code](https://github.com/langchain-ai/langchain)
feat: implement summarization (#14) 2026-01-19 16:17:31 +08:00			`# Conversation Summarization`

			`DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context.`

			`## Overview`

			The summarization feature uses LangChain's `SummarizationMiddleware` to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it:

			`1. Monitors message token counts in real-time`
			`2. Triggers summarization when thresholds are met`
			`3. Keeps recent messages intact while summarizing older exchanges`
			`4. Maintains AI/Tool message pairs together for context continuity`
			`5. Injects the summary back into the conversation`

			`## Configuration`

			Summarization is configured in `config.yaml` under the `summarization` key:

			```yaml
			`summarization:`
			`enabled: true`
			`model_name: null # Use default model or specify a lightweight model`

			`# Trigger conditions (OR logic - any condition triggers summarization)`
			`trigger:`
			`- type: tokens`
			`value: 4000`
			`# Additional triggers (optional)`
			`# - type: messages`
			`# value: 50`
			`# - type: fraction`
			`# value: 0.8 # 80% of model's max input tokens`

			`# Context retention policy`
			`keep:`
			`type: messages`
			`value: 20`

			`# Token trimming for summarization call`
			`trim_tokens_to_summarize: 4000`

			`# Custom summary prompt (optional)`
			`summary_prompt: null`
			```

			`### Configuration Options`

			#### `enabled`
			`- Type: Boolean`
			- Default: `false`
			`- Description: Enable or disable automatic summarization`

			#### `model_name`
			`- Type: String or null`
			- Default: `null` (uses default model)
			- Description: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like `gpt-4o-mini` or equivalent.

			#### `trigger`
			- Type: Single `ContextSize` or list of `ContextSize` objects
			`- Required: At least one trigger must be specified when enabled`
			`- Description: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met.`

			`ContextSize Types:`

			`1. Token-based trigger: Activates when token count reaches the specified value`
			```yaml
			`trigger:`
			`type: tokens`
			`value: 4000`
			```

			`2. Message-based trigger: Activates when message count reaches the specified value`
			```yaml
			`trigger:`
			`type: messages`
			`value: 50`
			```

			`3. Fraction-based trigger: Activates when token usage reaches a percentage of the model's maximum input tokens`
			```yaml
			`trigger:`
			`type: fraction`
			`value: 0.8 # 80% of max input tokens`
			```

			`Multiple Triggers:`
			```yaml
			`trigger:`
			`- type: tokens`
			`value: 4000`
			`- type: messages`
			`value: 50`
			```

			#### `keep`
			- Type: `ContextSize` object
			- Default: `{type: messages, value: 20}`
			`- Description: Specifies how much recent conversation history to preserve after summarization.`

			`Examples:`
			```yaml
			`# Keep most recent 20 messages`
			`keep:`
			`type: messages`
			`value: 20`

			`# Keep most recent 3000 tokens`
			`keep:`
			`type: tokens`
			`value: 3000`

			`# Keep most recent 30% of model's max input tokens`
			`keep:`
			`type: fraction`
			`value: 0.3`
			```

			#### `trim_tokens_to_summarize`
			`- Type: Integer or null`
			- Default: `4000`
			- Description: Maximum tokens to include when preparing messages for the summarization call itself. Set to `null` to skip trimming (not recommended for very long conversations).

			#### `summary_prompt`
			`- Type: String or null`
			- Default: `null` (uses LangChain's default prompt)
			`- Description: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context.`

			`Default Prompt Behavior:`
			`The default LangChain prompt instructs the model to:`
			`- Extract highest quality/most relevant context`
			`- Focus on information critical to the overall goal`
			`- Avoid repeating completed actions`
			`- Return only the extracted context`

			`## How It Works`

			`### Summarization Flow`

			`1. Monitoring: Before each model call, the middleware counts tokens in the message history`
			`2. Trigger Check: If any configured threshold is met, summarization is triggered`
			`3. Message Partitioning: Messages are split into:`
			- Messages to summarize (older messages beyond the `keep` threshold)
			- Messages to preserve (recent messages within the `keep` threshold)
			`4. Summary Generation: The model generates a concise summary of the older messages`
			`5. Context Replacement: The message history is updated:`
			`- All old messages are removed`
			`- A single summary message is added`
			`- Recent messages are preserved`
			`6. AI/Tool Pair Protection: The system ensures AI messages and their corresponding tool messages stay together`

			`### Token Counting`

			`- Uses approximate token counting based on character count`
			`- For Anthropic models: ~3.3 characters per token`
			`- For other models: Uses LangChain's default estimation`
			- Can be customized with a custom `token_counter` function

			`### Message Preservation`

			`The middleware intelligently preserves message context:`

			- Recent Messages: Always kept intact based on `keep` configuration
			`- AI/Tool Pairs: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together`
			`- Summary Format: Summary is injected as a HumanMessage with the format:`
			```
			`Here is a summary of the conversation to date:`

			`[Generated summary text]`
			```

			`## Best Practices`

			`### Choosing Trigger Thresholds`

			`1. Token-based triggers: Recommended for most use cases`
			`- Set to 60-80% of your model's context window`
			`- Example: For 8K context, use 4000-6000 tokens`

			`2. Message-based triggers: Useful for controlling conversation length`
			`- Good for applications with many short messages`
			`- Example: 50-100 messages depending on average message length`

			`3. Fraction-based triggers: Ideal when using multiple models`
			`- Automatically adapts to each model's capacity`
			`- Example: 0.8 (80% of model's max input tokens)`

			### Choosing Retention Policy (`keep`)

			`1. Message-based retention: Best for most scenarios`
			`- Preserves natural conversation flow`
			`- Recommended: 15-25 messages`

			`2. Token-based retention: Use when precise control is needed`
			`- Good for managing exact token budgets`
			`- Recommended: 2000-4000 tokens`

			`3. Fraction-based retention: For multi-model setups`
			`- Automatically scales with model capacity`
			`- Recommended: 0.2-0.4 (20-40% of max input)`

			`### Model Selection`

			`- Recommended: Use a lightweight, cost-effective model for summaries`
			- Examples: `gpt-4o-mini`, `claude-haiku`, or equivalent
			`- Summaries don't require the most powerful models`
			`- Significant cost savings on high-volume applications`

			- Default: If `model_name` is `null`, uses the default model
			`- May be more expensive but ensures consistency`
			`- Good for simple setups`

			`### Optimization Tips`

			`1. Balance triggers: Combine token and message triggers for robust handling`
			```yaml
			`trigger:`
			`- type: tokens`
			`value: 4000`
			`- type: messages`
			`value: 50`
			```

			`2. Conservative retention: Keep more messages initially, adjust based on performance`
			```yaml
			`keep:`
			`type: messages`
			`value: 25 # Start higher, reduce if needed`
			```

			`3. Trim strategically: Limit tokens sent to summarization model`
			```yaml
			`trim_tokens_to_summarize: 4000 # Prevents expensive summarization calls`
			```

			`4. Monitor and iterate: Track summary quality and adjust configuration`

			`## Troubleshooting`

			`### Summary Quality Issues`

			`Problem: Summaries losing important context`

			`Solutions:`
			1. Increase `keep` value to preserve more messages
			`2. Decrease trigger thresholds to summarize earlier`
			3. Customize `summary_prompt` to emphasize key information
			`4. Use a more capable model for summarization`

			`### Performance Issues`

			`Problem: Summarization calls taking too long`

			`Solutions:`
			1. Use a faster model for summaries (e.g., `gpt-4o-mini`)
			2. Reduce `trim_tokens_to_summarize` to send less context
			`3. Increase trigger thresholds to summarize less frequently`

			`### Token Limit Errors`

			`Problem: Still hitting token limits despite summarization`

			`Solutions:`
			`1. Lower trigger thresholds to summarize earlier`
			2. Reduce `keep` value to preserve fewer messages
			`3. Check if individual messages are very large`
			`4. Consider using fraction-based triggers`

			`## Implementation Details`

			`### Code Structure`

			- Configuration: `src/config/summarization_config.py`
			- Integration: `src/agents/lead_agent/agent.py`
			- Middleware: Uses `langchain.agents.middleware.SummarizationMiddleware`

			`### Middleware Order`

			`Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification:`

			`1. ThreadDataMiddleware`
			`2. SandboxMiddleware`
			`3. SummarizationMiddleware ← Runs here`
			`4. TitleMiddleware`
			`5. ClarificationMiddleware`

			`### State Management`

			`- Summarization is stateless - configuration is loaded once at startup`
			`- Summaries are added as regular messages in the conversation history`
			`- The checkpointer persists the summarized history automatically`

			`## Example Configurations`

			`### Minimal Configuration`
			```yaml
			`summarization:`
			`enabled: true`
			`trigger:`
			`type: tokens`
			`value: 4000`
			`keep:`
			`type: messages`
			`value: 20`
			```

			`### Production Configuration`
			```yaml
			`summarization:`
			`enabled: true`
			`model_name: gpt-4o-mini # Lightweight model for cost efficiency`
			`trigger:`
			`- type: tokens`
			`value: 6000`
			`- type: messages`
			`value: 75`
			`keep:`
			`type: messages`
			`value: 25`
			`trim_tokens_to_summarize: 5000`
			```

			`### Multi-Model Configuration`
			```yaml
			`summarization:`
			`enabled: true`
			`model_name: gpt-4o-mini`
			`trigger:`
			`type: fraction`
			`value: 0.7 # 70% of model's max input`
			`keep:`
			`type: fraction`
			`value: 0.3 # Keep 30% of max input`
			`trim_tokens_to_summarize: 4000`
			```

			`### Conservative Configuration (High Quality)`
			```yaml
			`summarization:`
			`enabled: true`
			`model_name: gpt-4 # Use full model for high-quality summaries`
			`trigger:`
			`type: tokens`
			`value: 8000`
			`keep:`
			`type: messages`
			`value: 40 # Keep more context`
			`trim_tokens_to_summarize: null # No trimming`
			```

			`## References`

			`- [LangChain Summarization Middleware Documentation](https://docs.langchain.com/oss/python/langchain/middleware/built-in#summarization)`
			`- [LangChain Source Code](https://github.com/langchain-ai/langchain)`