mirror of
https://gitee.com/wanwujie/deer-flow
synced 2026-04-25 23:14:46 +08:00
refactor: extract shared skill installer and upload manager to harness (#1202)
* refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
This commit is contained in:
@@ -19,12 +19,9 @@ import asyncio
|
||||
import json
|
||||
import logging
|
||||
import mimetypes
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import tempfile
|
||||
import uuid
|
||||
import zipfile
|
||||
from collections.abc import Generator
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
@@ -42,6 +39,17 @@ from deerflow.config.app_config import get_app_config, reload_app_config
|
||||
from deerflow.config.extensions_config import ExtensionsConfig, SkillStateConfig, get_extensions_config, reload_extensions_config
|
||||
from deerflow.config.paths import get_paths
|
||||
from deerflow.models import create_chat_model
|
||||
from deerflow.skills.installer import install_skill_from_archive
|
||||
from deerflow.uploads.manager import (
|
||||
claim_unique_filename,
|
||||
delete_file_safe,
|
||||
enrich_file_listing,
|
||||
ensure_uploads_dir,
|
||||
get_uploads_dir,
|
||||
list_files_in_dir,
|
||||
upload_artifact_url,
|
||||
upload_virtual_path,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -566,6 +574,7 @@ class DeerFlowClient:
|
||||
self._atomic_write_json(config_path, config_data)
|
||||
|
||||
self._agent = None
|
||||
self._agent_config_key = None
|
||||
reloaded = reload_extensions_config()
|
||||
return {"mcp_servers": {name: server.model_dump() for name, server in reloaded.mcp_servers.items()}}
|
||||
|
||||
@@ -631,6 +640,7 @@ class DeerFlowClient:
|
||||
self._atomic_write_json(config_path, config_data)
|
||||
|
||||
self._agent = None
|
||||
self._agent_config_key = None
|
||||
reload_extensions_config()
|
||||
|
||||
updated = next((s for s in load_skills(enabled_only=False) if s.name == name), None)
|
||||
@@ -657,56 +667,7 @@ class DeerFlowClient:
|
||||
FileNotFoundError: If the file does not exist.
|
||||
ValueError: If the file is invalid.
|
||||
"""
|
||||
from deerflow.skills.loader import get_skills_root_path
|
||||
from deerflow.skills.validation import _validate_skill_frontmatter
|
||||
|
||||
path = Path(skill_path)
|
||||
if not path.exists():
|
||||
raise FileNotFoundError(f"Skill file not found: {skill_path}")
|
||||
if not path.is_file():
|
||||
raise ValueError(f"Path is not a file: {skill_path}")
|
||||
if path.suffix != ".skill":
|
||||
raise ValueError("File must have .skill extension")
|
||||
if not zipfile.is_zipfile(path):
|
||||
raise ValueError("File is not a valid ZIP archive")
|
||||
|
||||
skills_root = get_skills_root_path()
|
||||
custom_dir = skills_root / "custom"
|
||||
custom_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
tmp_path = Path(tmp)
|
||||
with zipfile.ZipFile(path, "r") as zf:
|
||||
total_size = sum(info.file_size for info in zf.infolist())
|
||||
if total_size > 100 * 1024 * 1024:
|
||||
raise ValueError("Skill archive too large when extracted (>100MB)")
|
||||
for info in zf.infolist():
|
||||
if Path(info.filename).is_absolute() or ".." in Path(info.filename).parts:
|
||||
raise ValueError(f"Unsafe path in archive: {info.filename}")
|
||||
zf.extractall(tmp_path)
|
||||
for p in tmp_path.rglob("*"):
|
||||
if p.is_symlink():
|
||||
p.unlink()
|
||||
|
||||
items = list(tmp_path.iterdir())
|
||||
if not items:
|
||||
raise ValueError("Skill archive is empty")
|
||||
|
||||
skill_dir = items[0] if len(items) == 1 and items[0].is_dir() else tmp_path
|
||||
|
||||
is_valid, message, skill_name = _validate_skill_frontmatter(skill_dir)
|
||||
if not is_valid:
|
||||
raise ValueError(f"Invalid skill: {message}")
|
||||
if not re.fullmatch(r"[a-zA-Z0-9_-]+", skill_name):
|
||||
raise ValueError(f"Invalid skill name: {skill_name}")
|
||||
|
||||
target = custom_dir / skill_name
|
||||
if target.exists():
|
||||
raise ValueError(f"Skill '{skill_name}' already exists")
|
||||
|
||||
shutil.copytree(skill_dir, target)
|
||||
|
||||
return {"success": True, "skill_name": skill_name, "message": f"Skill '{skill_name}' installed successfully"}
|
||||
return install_skill_from_archive(skill_path)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Public API — memory management
|
||||
@@ -756,13 +717,6 @@ class DeerFlowClient:
|
||||
# Public API — file uploads
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _get_uploads_dir(thread_id: str) -> Path:
|
||||
"""Get (and create) the uploads directory for a thread."""
|
||||
base = get_paths().sandbox_uploads_dir(thread_id)
|
||||
base.mkdir(parents=True, exist_ok=True)
|
||||
return base
|
||||
|
||||
def upload_files(self, thread_id: str, files: list[str | Path]) -> dict:
|
||||
"""Upload local files into a thread's uploads directory.
|
||||
|
||||
@@ -784,7 +738,7 @@ class DeerFlowClient:
|
||||
|
||||
# Validate all files upfront to avoid partial uploads.
|
||||
resolved_files = []
|
||||
convertible_extensions = {ext.lower() for ext in CONVERTIBLE_EXTENSIONS}
|
||||
seen_names: set[str] = set()
|
||||
has_convertible_file = False
|
||||
for f in files:
|
||||
p = Path(f)
|
||||
@@ -792,11 +746,12 @@ class DeerFlowClient:
|
||||
raise FileNotFoundError(f"File not found: {f}")
|
||||
if not p.is_file():
|
||||
raise ValueError(f"Path is not a file: {f}")
|
||||
resolved_files.append(p)
|
||||
if not has_convertible_file and p.suffix.lower() in convertible_extensions:
|
||||
dest_name = claim_unique_filename(p.name, seen_names)
|
||||
resolved_files.append((p, dest_name))
|
||||
if not has_convertible_file and p.suffix.lower() in CONVERTIBLE_EXTENSIONS:
|
||||
has_convertible_file = True
|
||||
|
||||
uploads_dir = self._get_uploads_dir(thread_id)
|
||||
uploads_dir = ensure_uploads_dir(thread_id)
|
||||
uploaded_files: list[dict] = []
|
||||
|
||||
conversion_pool = None
|
||||
@@ -816,19 +771,21 @@ class DeerFlowClient:
|
||||
return asyncio.run(convert_file_to_markdown(path))
|
||||
|
||||
try:
|
||||
for src_path in resolved_files:
|
||||
dest = uploads_dir / src_path.name
|
||||
for src_path, dest_name in resolved_files:
|
||||
dest = uploads_dir / dest_name
|
||||
shutil.copy2(src_path, dest)
|
||||
|
||||
info: dict[str, Any] = {
|
||||
"filename": src_path.name,
|
||||
"filename": dest_name,
|
||||
"size": str(dest.stat().st_size),
|
||||
"path": str(dest),
|
||||
"virtual_path": f"/mnt/user-data/uploads/{src_path.name}",
|
||||
"artifact_url": f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{src_path.name}",
|
||||
"virtual_path": upload_virtual_path(dest_name),
|
||||
"artifact_url": upload_artifact_url(thread_id, dest_name),
|
||||
}
|
||||
if dest_name != src_path.name:
|
||||
info["original_filename"] = src_path.name
|
||||
|
||||
if src_path.suffix.lower() in convertible_extensions:
|
||||
if src_path.suffix.lower() in CONVERTIBLE_EXTENSIONS:
|
||||
try:
|
||||
if conversion_pool is not None:
|
||||
md_path = conversion_pool.submit(_convert_in_thread, dest).result()
|
||||
@@ -844,8 +801,9 @@ class DeerFlowClient:
|
||||
|
||||
if md_path is not None:
|
||||
info["markdown_file"] = md_path.name
|
||||
info["markdown_virtual_path"] = f"/mnt/user-data/uploads/{md_path.name}"
|
||||
info["markdown_artifact_url"] = f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{md_path.name}"
|
||||
info["markdown_path"] = str(uploads_dir / md_path.name)
|
||||
info["markdown_virtual_path"] = upload_virtual_path(md_path.name)
|
||||
info["markdown_artifact_url"] = upload_artifact_url(thread_id, md_path.name)
|
||||
|
||||
uploaded_files.append(info)
|
||||
finally:
|
||||
@@ -868,29 +826,9 @@ class DeerFlowClient:
|
||||
Dict with "files" and "count" keys, matching the Gateway API
|
||||
``list_uploaded_files`` response.
|
||||
"""
|
||||
uploads_dir = self._get_uploads_dir(thread_id)
|
||||
if not uploads_dir.exists():
|
||||
return {"files": [], "count": 0}
|
||||
|
||||
files = []
|
||||
with os.scandir(uploads_dir) as entries:
|
||||
file_entries = [entry for entry in entries if entry.is_file()]
|
||||
|
||||
for entry in sorted(file_entries, key=lambda item: item.name):
|
||||
stat = entry.stat()
|
||||
filename = entry.name
|
||||
files.append(
|
||||
{
|
||||
"filename": filename,
|
||||
"size": str(stat.st_size),
|
||||
"path": str(Path(entry.path)),
|
||||
"virtual_path": f"/mnt/user-data/uploads/{filename}",
|
||||
"artifact_url": f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{filename}",
|
||||
"extension": Path(filename).suffix,
|
||||
"modified": stat.st_mtime,
|
||||
}
|
||||
)
|
||||
return {"files": files, "count": len(files)}
|
||||
uploads_dir = get_uploads_dir(thread_id)
|
||||
result = list_files_in_dir(uploads_dir)
|
||||
return enrich_file_listing(result, thread_id)
|
||||
|
||||
def delete_upload(self, thread_id: str, filename: str) -> dict:
|
||||
"""Delete a file from a thread's uploads directory.
|
||||
@@ -907,19 +845,10 @@ class DeerFlowClient:
|
||||
FileNotFoundError: If the file does not exist.
|
||||
PermissionError: If path traversal is detected.
|
||||
"""
|
||||
uploads_dir = self._get_uploads_dir(thread_id)
|
||||
file_path = (uploads_dir / filename).resolve()
|
||||
from deerflow.utils.file_conversion import CONVERTIBLE_EXTENSIONS
|
||||
|
||||
try:
|
||||
file_path.relative_to(uploads_dir.resolve())
|
||||
except ValueError as exc:
|
||||
raise PermissionError("Access denied: path traversal detected") from exc
|
||||
|
||||
if not file_path.is_file():
|
||||
raise FileNotFoundError(f"File not found: {filename}")
|
||||
|
||||
file_path.unlink()
|
||||
return {"success": True, "message": f"Deleted {filename}"}
|
||||
uploads_dir = get_uploads_dir(thread_id)
|
||||
return delete_file_safe(uploads_dir, filename, convertible_extensions=CONVERTIBLE_EXTENSIONS)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Public API — artifacts
|
||||
@@ -939,19 +868,13 @@ class DeerFlowClient:
|
||||
FileNotFoundError: If the artifact does not exist.
|
||||
ValueError: If the path is invalid.
|
||||
"""
|
||||
virtual_prefix = "mnt/user-data"
|
||||
clean_path = path.lstrip("/")
|
||||
if not clean_path.startswith(virtual_prefix):
|
||||
raise ValueError(f"Path must start with /{virtual_prefix}")
|
||||
|
||||
relative = clean_path[len(virtual_prefix) :].lstrip("/")
|
||||
base_dir = get_paths().sandbox_user_data_dir(thread_id)
|
||||
actual = (base_dir / relative).resolve()
|
||||
|
||||
try:
|
||||
actual.relative_to(base_dir.resolve())
|
||||
actual = get_paths().resolve_virtual_path(thread_id, path)
|
||||
except ValueError as exc:
|
||||
raise PermissionError("Access denied: path traversal detected") from exc
|
||||
if "traversal" in str(exc):
|
||||
from deerflow.uploads.manager import PathTraversalError
|
||||
raise PathTraversalError("Path traversal detected") from exc
|
||||
raise
|
||||
if not actual.exists():
|
||||
raise FileNotFoundError(f"Artifact not found: {path}")
|
||||
if not actual.is_file():
|
||||
|
||||
Reference in New Issue
Block a user