deer-flow

wanwu/deer-flow

Fork 0

mirror of https://gitee.com/wanwujie/deer-flow synced 2026-04-03 22:32:12 +08:00

Commit Graph

Author	SHA1	Message	Date
Willem Jiang	8d9d767051	feat(eval): add report quality evaluation module and UI integration (#776 ) * feat(eval): add report quality evaluation module Addresses issue #773 - How to evaluate generated report quality objectively. This module provides two evaluation approaches: 1. Automated metrics (no LLM required): - Citation count and source diversity - Word count compliance per report style - Section structure validation - Image inclusion tracking 2. LLM-as-Judge evaluation: - Factual accuracy scoring - Completeness assessment - Coherence evaluation - Relevance and citation quality checks The combined evaluator provides a final score (1-10) and letter grade (A+ to F). Files added: - src/eval/__init__.py - src/eval/metrics.py - src/eval/llm_judge.py - src/eval/evaluator.py - tests/unit/eval/test_metrics.py - tests/unit/eval/test_evaluator.py * feat(eval): integrate report evaluation with web UI This commit adds the web UI integration for the evaluation module: Backend: - Add EvaluateReportRequest/Response models in src/server/eval_request.py - Add /api/report/evaluate endpoint to src/server/app.py Frontend: - Add evaluateReport API function in web/src/core/api/evaluate.ts - Create EvaluationDialog component with grade badge, metrics display, and optional LLM deep evaluation - Add evaluation button (graduation cap icon) to research-block.tsx toolbar - Add i18n translations for English and Chinese The evaluation UI allows users to: 1. View quick metrics-only evaluation (instant) 2. Optionally run deep LLM-based evaluation for detailed analysis 3. See grade (A+ to F), score (1-10), and metric breakdown * feat(eval): improve evaluation reliability and add LLM judge tests - Extract MAX_REPORT_LENGTH constant in llm_judge.py for maintainability - Add comprehensive unit tests for LLMJudge class (parse_response, calculate_weighted_score, evaluate with mocked LLM) - Pass reportStyle prop to EvaluationDialog for accurate evaluation criteria - Add researchQueries store map to reliably associate queries with research - Add getResearchQuery helper to retrieve query by researchId - Remove unused imports in test_metrics.py * fix(eval): use resolveServiceURL for evaluate API endpoint The evaluateReport function was using a relative URL '/api/report/evaluate' which sent requests to the Next.js server instead of the FastAPI backend. Changed to use resolveServiceURL() consistent with other API functions. * fix: improve type accuracy and React hooks in evaluation components - Fix get_word_count_target return type from Optional[Dict] to Dict since it always returns a value via default fallback - Fix useEffect dependency issue in EvaluationDialog using useRef to prevent unwanted re-evaluations - Add aria-label to GradeBadge for screen reader accessibility	2025-12-25 21:55:48 +08:00

Author

SHA1

Message

Date

Willem Jiang

8d9d767051

feat(eval): add report quality evaluation module and UI integration (#776 )

* feat(eval): add report quality evaluation module

Addresses issue #773 - How to evaluate generated report quality objectively.

This module provides two evaluation approaches:
1. Automated metrics (no LLM required):
   - Citation count and source diversity
   - Word count compliance per report style
   - Section structure validation
   - Image inclusion tracking

2. LLM-as-Judge evaluation:
   - Factual accuracy scoring
   - Completeness assessment
   - Coherence evaluation
   - Relevance and citation quality checks

The combined evaluator provides a final score (1-10) and letter grade (A+ to F).

Files added:
- src/eval/__init__.py
- src/eval/metrics.py
- src/eval/llm_judge.py
- src/eval/evaluator.py
- tests/unit/eval/test_metrics.py
- tests/unit/eval/test_evaluator.py

* feat(eval): integrate report evaluation with web UI

This commit adds the web UI integration for the evaluation module:

Backend:
- Add EvaluateReportRequest/Response models in src/server/eval_request.py
- Add /api/report/evaluate endpoint to src/server/app.py

Frontend:
- Add evaluateReport API function in web/src/core/api/evaluate.ts
- Create EvaluationDialog component with grade badge, metrics display,
  and optional LLM deep evaluation
- Add evaluation button (graduation cap icon) to research-block.tsx toolbar
- Add i18n translations for English and Chinese

The evaluation UI allows users to:
1. View quick metrics-only evaluation (instant)
2. Optionally run deep LLM-based evaluation for detailed analysis
3. See grade (A+ to F), score (1-10), and metric breakdown

* feat(eval): improve evaluation reliability and add LLM judge tests

- Extract MAX_REPORT_LENGTH constant in llm_judge.py for maintainability
- Add comprehensive unit tests for LLMJudge class (parse_response,
  calculate_weighted_score, evaluate with mocked LLM)
- Pass reportStyle prop to EvaluationDialog for accurate evaluation criteria
- Add researchQueries store map to reliably associate queries with research
- Add getResearchQuery helper to retrieve query by researchId
- Remove unused imports in test_metrics.py

* fix(eval): use resolveServiceURL for evaluate API endpoint

The evaluateReport function was using a relative URL '/api/report/evaluate'
which sent requests to the Next.js server instead of the FastAPI backend.
Changed to use resolveServiceURL() consistent with other API functions.

* fix: improve type accuracy and React hooks in evaluation components

- Fix get_word_count_target return type from Optional[Dict] to Dict since it always returns a value via default fallback
- Fix useEffect dependency issue in EvaluationDialog using useRef to prevent unwanted re-evaluations
- Add aria-label to GradeBadge for screen reader accessibility

2025-12-25 21:55:48 +08:00

1 Commits