feat(eval): add report quality evaluation module and UI integration (#776)

* feat(eval): add report quality evaluation module Addresses issue #773 - How to evaluate generated report quality objectively. This module provides two evaluation approaches: 1. Automated metrics (no LLM required): - Citation count and source diversity - Word count compliance per report style - Section structure validation - Image inclusion tracking 2. LLM-as-Judge evaluation: - Factual accuracy scoring - Completeness assessment - Coherence evaluation - Relevance and citation quality checks The combined evaluator provides a final score (1-10) and letter grade (A+ to F). Files added: - src/eval/__init__.py - src/eval/metrics.py - src/eval/llm_judge.py - src/eval/evaluator.py - tests/unit/eval/test_metrics.py - tests/unit/eval/test_evaluator.py * feat(eval): integrate report evaluation with web UI This commit adds the web UI integration for the evaluation module: Backend: - Add EvaluateReportRequest/Response models in src/server/eval_request.py - Add /api/report/evaluate endpoint to src/server/app.py Frontend: - Add evaluateReport API function in web/src/core/api/evaluate.ts - Create EvaluationDialog component with grade badge, metrics display, and optional LLM deep evaluation - Add evaluation button (graduation cap icon) to research-block.tsx toolbar - Add i18n translations for English and Chinese The evaluation UI allows users to: 1. View quick metrics-only evaluation (instant) 2. Optionally run deep LLM-based evaluation for detailed analysis 3. See grade (A+ to F), score (1-10), and metric breakdown * feat(eval): improve evaluation reliability and add LLM judge tests - Extract MAX_REPORT_LENGTH constant in llm_judge.py for maintainability - Add comprehensive unit tests for LLMJudge class (parse_response, calculate_weighted_score, evaluate with mocked LLM) - Pass reportStyle prop to EvaluationDialog for accurate evaluation criteria - Add researchQueries store map to reliably associate queries with research - Add getResearchQuery helper to retrieve query by researchId - Remove unused imports in test_metrics.py * fix(eval): use resolveServiceURL for evaluate API endpoint The evaluateReport function was using a relative URL '/api/report/evaluate' which sent requests to the Next.js server instead of the FastAPI backend. Changed to use resolveServiceURL() consistent with other API functions. * fix: improve type accuracy and React hooks in evaluation components - Fix get_word_count_target return type from Optional[Dict] to Dict since it always returns a value via default fallback - Fix useEffect dependency issue in EvaluationDialog using useRef to prevent unwanted re-evaluations - Add aria-label to GradeBadge for screen reader accessibility
2026-04-20 21:04:45 +08:00 · 2025-12-25 21:55:48 +08:00
parent 84a7f7815c
commit 8d9d767051
17 changed files with 2103 additions and 2 deletions
--- a/web/messages/en.json
+++ b/web/messages/en.json
@@ -150,6 +150,7 @@
      "downloadWord": "Word (.docx)",
      "downloadImage": "Image (.png)",
      "exportFailed": "Export failed, please try again",
+      "evaluateReport": "Evaluate report quality",
      "searchingFor": "Searching for",
      "reading": "Reading",
      "runningPythonCode": "Running Python code",
@@ -163,6 +164,31 @@
      "errorGeneratingPodcast": "Error when generating podcast. Please try again.",
      "downloadPodcast": "Download podcast"
    },
+    "evaluation": {
+      "title": "Report Quality Evaluation",
+      "description": "Evaluate your report using automated metrics and AI analysis.",
+      "evaluating": "Evaluating report...",
+      "analyzing": "Running deep analysis...",
+      "overallScore": "Overall Score",
+      "metrics": "Report Metrics",
+      "wordCount": "Word Count",
+      "citations": "Citations",
+      "sources": "Unique Sources",
+      "images": "Images",
+      "sectionCoverage": "Section Coverage",
+      "detailedAnalysis": "Detailed Analysis",
+      "deepEvaluation": "Deep Evaluation (AI)",
+      "strengths": "Strengths",
+      "weaknesses": "Areas for Improvement",
+      "scores": {
+        "factual_accuracy": "Factual Accuracy",
+        "completeness": "Completeness",
+        "coherence": "Coherence",
+        "relevance": "Relevance",
+        "citation_quality": "Citation Quality",
+        "writing_quality": "Writing Quality"
+      }
+    },
    "messages": {
      "replaying": "Replaying",
      "replayDescription": "DeerFlow is now replaying the conversation...",