fix(memory): inject stored facts into system prompt memory context (#1083)

* fix(memory): inject stored facts into system prompt memory context - add Facts section rendering in format_memory_for_injection - rank facts by confidence and coerce confidence values safely - enforce max token budget while appending fact lines - add regression tests for fact inclusion, ordering, and budget behavior Fixes #1059 * Update the document with the latest status * fix(memory): harden fact injection — NaN/inf confidence, None content, incremental token budget (#1090) * Initial plan * fix(memory): address review feedback on confidence coercion, None content, and token budget Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
2026-04-20 12:54:45 +08:00 · 2026-03-13 14:37:40 +08:00
parent 3521cc2668
commit b5fcb1334a
4 changed files with 255 additions and 505 deletions
--- a/backend/docs/MEMORY_IMPROVEMENTS.md
+++ b/backend/docs/MEMORY_IMPROVEMENTS.md
@@ -1,281 +1,65 @@
 # Memory System Improvements

-This document describes recent improvements to the memory system's fact injection mechanism.
+This document tracks memory injection behavior and roadmap status.

-## Overview
+## Status (As Of 2026-03-10)

-Two major improvements have been made to the `format_memory_for_injection` function:
+Implemented in `main`:
+- Accurate token counting via `tiktoken` in `format_memory_for_injection`.
+- Facts are injected into prompt memory context.
+- Facts are ranked by confidence (descending).
+- Injection respects `max_injection_tokens` budget.

-1. **Similarity-Based Fact Retrieval**: Uses TF-IDF to select facts most relevant to current conversation context
-2. **Accurate Token Counting**: Uses tiktoken for precise token estimation instead of rough character-based approximation
+Planned / not yet merged:
+- TF-IDF similarity-based fact retrieval.
+- `current_context` input for context-aware scoring.
+- Configurable similarity/confidence weights (`similarity_weight`, `confidence_weight`).
+- Middleware/runtime wiring for context-aware retrieval before each model call.

-## 1. Similarity-Based Fact Retrieval
+## Current Behavior

-### Problem
-The original implementation selected facts based solely on confidence scores, taking the top 15 highest-confidence facts regardless of their relevance to the current conversation. This could result in injecting irrelevant facts while omitting contextually important ones.
-
-### Solution
-The new implementation uses **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorization with cosine similarity to measure how relevant each fact is to the current conversation context.
-
-**Scoring Formula**:
-```
-final_score = (similarity × 0.6) + (confidence × 0.4)
-```
-
- **Similarity (60% weight)**: Cosine similarity between fact content and current context
- **Confidence (40% weight)**: LLM-assigned confidence score (0-1)
-
-### Benefits
- **Context-Aware**: Prioritizes facts relevant to what the user is currently discussing
- **Dynamic**: Different facts surface based on conversation topic
- **Balanced**: Considers both relevance and reliability
- **Fallback**: Gracefully degrades to confidence-only ranking if context is unavailable
-
-### Example
-Given facts about Python, React, and Docker:
- User asks: *"How should I write Python tests?"*
-  - Prioritizes: Python testing, type hints, pytest
- User asks: *"How to optimize my Next.js app?"*
-  - Prioritizes: React/Next.js experience, performance optimization
-
-### Configuration
-Customize weights in `config.yaml` (optional):
-```yaml
-memory:
-  similarity_weight: 0.6  # Weight for TF-IDF similarity (0-1)
-  confidence_weight: 0.4  # Weight for confidence score (0-1)
-```
-
-**Note**: Weights should sum to 1.0 for best results.
-
-## 2. Accurate Token Counting
-
-### Problem
-The original implementation estimated tokens using a simple formula:
-```python
-max_chars = max_tokens * 4
-```
-
-This assumes ~4 characters per token, which is:
- Inaccurate for many languages and content types
- Can lead to over-injection (exceeding token limits)
- Can lead to under-injection (wasting available budget)
-
-### Solution
-The new implementation uses **tiktoken**, OpenAI's official tokenizer library, to count tokens accurately:
+Function today:

 ```python
-import tiktoken
-
-def _count_tokens(text: str, encoding_name: str = "cl100k_base") -> int:
-    encoding = tiktoken.get_encoding(encoding_name)
-    return len(encoding.encode(text))
+def format_memory_for_injection(memory_data: dict[str, Any], max_tokens: int = 2000) -> str:
 ```

- Uses `cl100k_base` encoding (GPT-4, GPT-3.5, text-embedding-ada-002)
- Provides exact token counts for budget management
- Falls back to character-based estimation if tiktoken fails
+Current injection format:
+- `User Context` section from `user.*.summary`
+- `History` section from `history.*.summary`
+- `Facts` section from `facts[]`, sorted by confidence, appended until token budget is reached

-### Benefits
- **Precision**: Exact token counts match what the model sees
- **Budget Optimization**: Maximizes use of available token budget
- **No Overflows**: Prevents exceeding `max_injection_tokens` limit
- **Better Planning**: Each section's token cost is known precisely
+Token counting:
+- Uses `tiktoken` (`cl100k_base`) when available
+- Falls back to `len(text) // 4` if tokenizer import fails

-### Example
-```python
-text = "This is a test string to count tokens accurately using tiktoken."
+## Known Gap

-# Old method
-char_count = len(text)  # 64 characters
-old_estimate = char_count // 4  # 16 tokens (overestimate)
+Previous versions of this document described TF-IDF/context-aware retrieval as if it were already shipped.
+That was not accurate for `main` and caused confusion.

-# New method
-accurate_count = _count_tokens(text)  # 13 tokens (exact)
+Issue reference: `#1059`
+
+## Roadmap (Planned)
+
+Planned scoring strategy:
+
+```text
+final_score = (similarity * 0.6) + (confidence * 0.4)
 ```

-**Result**: 3-token difference (18.75% error rate)
+Planned integration shape:
+1. Extract recent conversational context from filtered user/final-assistant turns.
+2. Compute TF-IDF cosine similarity between each fact and current context.
+3. Rank by weighted score and inject under token budget.
+4. Fall back to confidence-only ranking if context is unavailable.

-In production, errors can be much larger for:
- Code snippets (more tokens per character)
- Non-English text (variable token ratios)
- Technical jargon (often multi-token words)
+## Validation

-## Implementation Details
+Current regression coverage includes:
+- facts inclusion in memory injection output
+- confidence ordering
+- token-budget-limited fact inclusion

-### Function Signature
-```python
-def format_memory_for_injection(
-    memory_data: dict[str, Any],
-    max_tokens: int = 2000,
-    current_context: str | None = None,
-) -> str:
-```
-
-**New Parameter**:
- `current_context`: Optional string containing recent conversation messages for similarity calculation
-
-### Backward Compatibility
-The function remains **100% backward compatible**:
- If `current_context` is `None` or empty, falls back to confidence-only ranking
- Existing callers without the parameter work exactly as before
- Token counting is always accurate (transparent improvement)
-
-### Integration Point
-Memory is **dynamically injected** via `MemoryMiddleware.before_model()`:
-
-```python
-# src/agents/middlewares/memory_middleware.py
-
-def _extract_conversation_context(messages: list, max_turns: int = 3) -> str:
-    """Extract recent conversation (user input + final responses only)."""
-    context_parts = []
-    turn_count = 0
-
-    for msg in reversed(messages):
-        if msg.type == "human":
-            # Always include user messages
-            context_parts.append(extract_text(msg))
-            turn_count += 1
-            if turn_count >= max_turns:
-                break
-
-        elif msg.type == "ai" and not msg.tool_calls:
-            # Only include final AI responses (no tool_calls)
-            context_parts.append(extract_text(msg))
-
-        # Skip tool messages and AI messages with tool_calls
-
-    return " ".join(reversed(context_parts))
-
-
-class MemoryMiddleware:
-    def before_model(self, state, runtime):
-        """Inject memory before EACH LLM call (not just before_agent)."""
-
-        # Get recent conversation context (filtered)
-        conversation_context = _extract_conversation_context(
-            state["messages"],
-            max_turns=3
-        )
-
-        # Load memory with context-aware fact selection
-        memory_data = get_memory_data()
-        memory_content = format_memory_for_injection(
-            memory_data,
-            max_tokens=config.max_injection_tokens,
-            current_context=conversation_context,  # ✅ Clean conversation only
-        )
-
-        # Inject as system message
-        memory_message = SystemMessage(
-            content=f"<memory>\n{memory_content}\n</memory>",
-            name="memory_context",
-        )
-
-        return {"messages": [memory_message] + state["messages"]}
-```
-
-### How It Works
-
-1. **User continues conversation**:
-   ```
-   Turn 1: "I'm working on a Python project"
-   Turn 2: "It uses FastAPI and SQLAlchemy"
-   Turn 3: "How do I write tests?"  ← Current query
-   ```
-
-2. **Extract recent context**: Last 3 turns combined:
-   ```
-   "I'm working on a Python project. It uses FastAPI and SQLAlchemy. How do I write tests?"
-   ```
-
-3. **TF-IDF scoring**: Ranks facts by relevance to this context
-   - High score: "Prefers pytest for testing" (testing + Python)
-   - High score: "Likes type hints in Python" (Python related)
-   - High score: "Expert in Python and FastAPI" (Python + FastAPI)
-   - Low score: "Uses Docker for containerization" (less relevant)
-
-4. **Injection**: Top-ranked facts injected into system prompt's `<memory>` section
-
-5. **Agent sees**: Full system prompt with relevant memory context
-
-### Benefits of Dynamic System Prompt
-
- **Multi-Turn Context**: Uses last 3 turns, not just current question
-  - Captures ongoing conversation flow
-  - Better understanding of user's current focus
- **Query-Specific Facts**: Different facts surface based on conversation topic
- **Clean Architecture**: No middleware message manipulation
- **LangChain Native**: Uses built-in dynamic system prompt support
- **Runtime Flexibility**: Memory regenerated for each agent invocation
-
-## Dependencies
-
-New dependencies added to `pyproject.toml`:
-```toml
-dependencies = [
-    # ... existing dependencies ...
-    "tiktoken>=0.8.0",      # Accurate token counting
-    "scikit-learn>=1.6.1",  # TF-IDF vectorization
-]
-```
-
-Install with:
-```bash
-cd backend
-uv sync
-```
-
-## Testing
-
-Run the test script to verify improvements:
-```bash
-cd backend
-python test_memory_improvement.py
-```
-
-Expected output shows:
- Different fact ordering based on context
- Accurate token counts vs old estimates
- Budget-respecting fact selection
-
-## Performance Impact
-
-### Computational Cost
- **TF-IDF Calculation**: O(n × m) where n=facts, m=vocabulary
-  - Negligible for typical fact counts (10-100 facts)
-  - Caching opportunities if context doesn't change
- **Token Counting**: ~10-100µs per call
-  - Faster than the old character-counting approach
-  - Minimal overhead compared to LLM inference
-
-### Memory Usage
- **TF-IDF Vectorizer**: ~1-5MB for typical vocabulary
-  - Instantiated once per injection call
-  - Garbage collected after use
- **Tiktoken Encoding**: ~1MB (cached singleton)
-  - Loaded once per process lifetime
-
-### Recommendations
- Current implementation is optimized for accuracy over caching
- For high-throughput scenarios, consider:
-  - Pre-computing fact embeddings (store in memory.json)
-  - Caching TF-IDF vectorizer between calls
-  - Using approximate nearest neighbor search for >1000 facts
-
-## Summary
-
-| Aspect | Before | After |
-|--------|--------|-------|
-| Fact Selection | Top 15 by confidence only | Relevance-based (similarity + confidence) |
-| Token Counting | `len(text) // 4` | `tiktoken.encode(text)` |
-| Context Awareness | None | TF-IDF cosine similarity |
-| Accuracy | ±25% token estimate | Exact token count |
-| Configuration | Fixed weights | Customizable similarity/confidence weights |
-
-These improvements result in:
- **More relevant** facts injected into context
- **Better utilization** of available token budget
- **Fewer hallucinations** due to focused context
- **Higher quality** agent responses
+Tests:
+- `backend/tests/test_memory_prompt_injection.py`