refactor: extract shared skill installer and upload manager to harness (#1202)

* refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-04-25 23:14:46 +08:00 · 2026-03-25 16:28:33 +08:00
parent ec46ae075d
commit b8bc80d89b
14 changed files with 2591 additions and 567 deletions
--- a/backend/packages/harness/deerflow/client.py
+++ b/backend/packages/harness/deerflow/client.py
@@ -19,12 +19,9 @@ import asyncio
 import json
 import logging
 import mimetypes
-import os
-import re
 import shutil
 import tempfile
 import uuid
-import zipfile
 from collections.abc import Generator
 from dataclasses import dataclass, field
 from pathlib import Path
@@ -42,6 +39,17 @@ from deerflow.config.app_config import get_app_config, reload_app_config
 from deerflow.config.extensions_config import ExtensionsConfig, SkillStateConfig, get_extensions_config, reload_extensions_config
 from deerflow.config.paths import get_paths
 from deerflow.models import create_chat_model
+from deerflow.skills.installer import install_skill_from_archive
+from deerflow.uploads.manager import (
+    claim_unique_filename,
+    delete_file_safe,
+    enrich_file_listing,
+    ensure_uploads_dir,
+    get_uploads_dir,
+    list_files_in_dir,
+    upload_artifact_url,
+    upload_virtual_path,
+)

 logger = logging.getLogger(__name__)

@@ -566,6 +574,7 @@ class DeerFlowClient:
        self._atomic_write_json(config_path, config_data)

        self._agent = None
+        self._agent_config_key = None
        reloaded = reload_extensions_config()
        return {"mcp_servers": {name: server.model_dump() for name, server in reloaded.mcp_servers.items()}}

@@ -631,6 +640,7 @@ class DeerFlowClient:
        self._atomic_write_json(config_path, config_data)

        self._agent = None
+        self._agent_config_key = None
        reload_extensions_config()

        updated = next((s for s in load_skills(enabled_only=False) if s.name == name), None)
@@ -657,56 +667,7 @@ class DeerFlowClient:
            FileNotFoundError: If the file does not exist.
            ValueError: If the file is invalid.
        """
-        from deerflow.skills.loader import get_skills_root_path
-        from deerflow.skills.validation import _validate_skill_frontmatter
-
-        path = Path(skill_path)
-        if not path.exists():
-            raise FileNotFoundError(f"Skill file not found: {skill_path}")
-        if not path.is_file():
-            raise ValueError(f"Path is not a file: {skill_path}")
-        if path.suffix != ".skill":
-            raise ValueError("File must have .skill extension")
-        if not zipfile.is_zipfile(path):
-            raise ValueError("File is not a valid ZIP archive")
-
-        skills_root = get_skills_root_path()
-        custom_dir = skills_root / "custom"
-        custom_dir.mkdir(parents=True, exist_ok=True)
-
-        with tempfile.TemporaryDirectory() as tmp:
-            tmp_path = Path(tmp)
-            with zipfile.ZipFile(path, "r") as zf:
-                total_size = sum(info.file_size for info in zf.infolist())
-                if total_size > 100 * 1024 * 1024:
-                    raise ValueError("Skill archive too large when extracted (>100MB)")
-                for info in zf.infolist():
-                    if Path(info.filename).is_absolute() or ".." in Path(info.filename).parts:
-                        raise ValueError(f"Unsafe path in archive: {info.filename}")
-                zf.extractall(tmp_path)
-            for p in tmp_path.rglob("*"):
-                if p.is_symlink():
-                    p.unlink()
-
-            items = list(tmp_path.iterdir())
-            if not items:
-                raise ValueError("Skill archive is empty")
-
-            skill_dir = items[0] if len(items) == 1 and items[0].is_dir() else tmp_path
-
-            is_valid, message, skill_name = _validate_skill_frontmatter(skill_dir)
-            if not is_valid:
-                raise ValueError(f"Invalid skill: {message}")
-            if not re.fullmatch(r"[a-zA-Z0-9_-]+", skill_name):
-                raise ValueError(f"Invalid skill name: {skill_name}")
-
-            target = custom_dir / skill_name
-            if target.exists():
-                raise ValueError(f"Skill '{skill_name}' already exists")
-
-            shutil.copytree(skill_dir, target)
-
-        return {"success": True, "skill_name": skill_name, "message": f"Skill '{skill_name}' installed successfully"}
+        return install_skill_from_archive(skill_path)

    # ------------------------------------------------------------------
    # Public API — memory management
@@ -756,13 +717,6 @@ class DeerFlowClient:
    # Public API — file uploads
    # ------------------------------------------------------------------

-    @staticmethod
-    def _get_uploads_dir(thread_id: str) -> Path:
-        """Get (and create) the uploads directory for a thread."""
-        base = get_paths().sandbox_uploads_dir(thread_id)
-        base.mkdir(parents=True, exist_ok=True)
-        return base
-
    def upload_files(self, thread_id: str, files: list[str | Path]) -> dict:
        """Upload local files into a thread's uploads directory.

@@ -784,7 +738,7 @@ class DeerFlowClient:

        # Validate all files upfront to avoid partial uploads.
        resolved_files = []
-        convertible_extensions = {ext.lower() for ext in CONVERTIBLE_EXTENSIONS}
+        seen_names: set[str] = set()
        has_convertible_file = False
        for f in files:
            p = Path(f)
@@ -792,11 +746,12 @@ class DeerFlowClient:
                raise FileNotFoundError(f"File not found: {f}")
            if not p.is_file():
                raise ValueError(f"Path is not a file: {f}")
-            resolved_files.append(p)
-            if not has_convertible_file and p.suffix.lower() in convertible_extensions:
+            dest_name = claim_unique_filename(p.name, seen_names)
+            resolved_files.append((p, dest_name))
+            if not has_convertible_file and p.suffix.lower() in CONVERTIBLE_EXTENSIONS:
                has_convertible_file = True

-        uploads_dir = self._get_uploads_dir(thread_id)
+        uploads_dir = ensure_uploads_dir(thread_id)
        uploaded_files: list[dict] = []

        conversion_pool = None
@@ -816,19 +771,21 @@ class DeerFlowClient:
            return asyncio.run(convert_file_to_markdown(path))

        try:
-            for src_path in resolved_files:
-                dest = uploads_dir / src_path.name
+            for src_path, dest_name in resolved_files:
+                dest = uploads_dir / dest_name
                shutil.copy2(src_path, dest)

                info: dict[str, Any] = {
-                    "filename": src_path.name,
+                    "filename": dest_name,
                    "size": str(dest.stat().st_size),
                    "path": str(dest),
-                    "virtual_path": f"/mnt/user-data/uploads/{src_path.name}",
-                    "artifact_url": f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{src_path.name}",
+                    "virtual_path": upload_virtual_path(dest_name),
+                    "artifact_url": upload_artifact_url(thread_id, dest_name),
                }
+                if dest_name != src_path.name:
+                    info["original_filename"] = src_path.name

-                if src_path.suffix.lower() in convertible_extensions:
+                if src_path.suffix.lower() in CONVERTIBLE_EXTENSIONS:
                    try:
                        if conversion_pool is not None:
                            md_path = conversion_pool.submit(_convert_in_thread, dest).result()
@@ -844,8 +801,9 @@ class DeerFlowClient:

                    if md_path is not None:
                        info["markdown_file"] = md_path.name
-                        info["markdown_virtual_path"] = f"/mnt/user-data/uploads/{md_path.name}"
-                        info["markdown_artifact_url"] = f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{md_path.name}"
+                        info["markdown_path"] = str(uploads_dir / md_path.name)
+                        info["markdown_virtual_path"] = upload_virtual_path(md_path.name)
+                        info["markdown_artifact_url"] = upload_artifact_url(thread_id, md_path.name)

                uploaded_files.append(info)
        finally:
@@ -868,29 +826,9 @@ class DeerFlowClient:
            Dict with "files" and "count" keys, matching the Gateway API
            ``list_uploaded_files`` response.
        """
-        uploads_dir = self._get_uploads_dir(thread_id)
-        if not uploads_dir.exists():
-            return {"files": [], "count": 0}
-
-        files = []
-        with os.scandir(uploads_dir) as entries:
-            file_entries = [entry for entry in entries if entry.is_file()]
-
-        for entry in sorted(file_entries, key=lambda item: item.name):
-            stat = entry.stat()
-            filename = entry.name
-            files.append(
-                {
-                    "filename": filename,
-                    "size": str(stat.st_size),
-                    "path": str(Path(entry.path)),
-                    "virtual_path": f"/mnt/user-data/uploads/{filename}",
-                    "artifact_url": f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{filename}",
-                    "extension": Path(filename).suffix,
-                    "modified": stat.st_mtime,
-                }
-            )
-        return {"files": files, "count": len(files)}
+        uploads_dir = get_uploads_dir(thread_id)
+        result = list_files_in_dir(uploads_dir)
+        return enrich_file_listing(result, thread_id)

    def delete_upload(self, thread_id: str, filename: str) -> dict:
        """Delete a file from a thread's uploads directory.
@@ -907,19 +845,10 @@ class DeerFlowClient:
            FileNotFoundError: If the file does not exist.
            PermissionError: If path traversal is detected.
        """
-        uploads_dir = self._get_uploads_dir(thread_id)
-        file_path = (uploads_dir / filename).resolve()
+        from deerflow.utils.file_conversion import CONVERTIBLE_EXTENSIONS

-        try:
-            file_path.relative_to(uploads_dir.resolve())
-        except ValueError as exc:
-            raise PermissionError("Access denied: path traversal detected") from exc
-
-        if not file_path.is_file():
-            raise FileNotFoundError(f"File not found: {filename}")
-
-        file_path.unlink()
-        return {"success": True, "message": f"Deleted {filename}"}
+        uploads_dir = get_uploads_dir(thread_id)
+        return delete_file_safe(uploads_dir, filename, convertible_extensions=CONVERTIBLE_EXTENSIONS)

    # ------------------------------------------------------------------
    # Public API — artifacts
@@ -939,19 +868,13 @@ class DeerFlowClient:
            FileNotFoundError: If the artifact does not exist.
            ValueError: If the path is invalid.
        """
-        virtual_prefix = "mnt/user-data"
-        clean_path = path.lstrip("/")
-        if not clean_path.startswith(virtual_prefix):
-            raise ValueError(f"Path must start with /{virtual_prefix}")
-
-        relative = clean_path[len(virtual_prefix) :].lstrip("/")
-        base_dir = get_paths().sandbox_user_data_dir(thread_id)
-        actual = (base_dir / relative).resolve()
-
        try:
-            actual.relative_to(base_dir.resolve())
+            actual = get_paths().resolve_virtual_path(thread_id, path)
        except ValueError as exc:
-            raise PermissionError("Access denied: path traversal detected") from exc
+            if "traversal" in str(exc):
+                from deerflow.uploads.manager import PathTraversalError
+                raise PathTraversalError("Path traversal detected") from exc
+            raise
        if not actual.exists():
            raise FileNotFoundError(f"Artifact not found: {path}")
        if not actual.is_file():