Files
deer-flow/backend/packages/harness/deerflow/skills/installer.py

184 lines
6.0 KiB
Python
Raw Normal View History

refactor: extract shared skill installer and upload manager to harness (#1202) * refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00
"""Shared skill archive installation logic.
Pure business logic no FastAPI/HTTP dependencies.
Both Gateway and Client delegate to these functions.
"""
import logging
import posixpath
refactor: extract shared skill installer and upload manager to harness (#1202) * refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00
import shutil
import stat
import tempfile
import zipfile
from pathlib import Path, PurePosixPath, PureWindowsPath
refactor: extract shared skill installer and upload manager to harness (#1202) * refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00
from deerflow.skills.loader import get_skills_root_path
from deerflow.skills.validation import _validate_skill_frontmatter
logger = logging.getLogger(__name__)
class SkillAlreadyExistsError(ValueError):
"""Raised when a skill with the same name is already installed."""
def is_unsafe_zip_member(info: zipfile.ZipInfo) -> bool:
"""Return True if the zip member path is absolute or attempts directory traversal."""
name = info.filename
if not name:
return False
normalized = name.replace("\\", "/")
if normalized.startswith("/"):
return True
path = PurePosixPath(normalized)
refactor: extract shared skill installer and upload manager to harness (#1202) * refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00
if path.is_absolute():
return True
if PureWindowsPath(name).is_absolute():
return True
refactor: extract shared skill installer and upload manager to harness (#1202) * refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00
if ".." in path.parts:
return True
return False
def is_symlink_member(info: zipfile.ZipInfo) -> bool:
"""Detect symlinks based on the external attributes stored in the ZipInfo."""
mode = info.external_attr >> 16
return stat.S_ISLNK(mode)
def should_ignore_archive_entry(path: Path) -> bool:
"""Return True for macOS metadata dirs and dotfiles."""
return path.name.startswith(".") or path.name == "__MACOSX"
def resolve_skill_dir_from_archive(temp_path: Path) -> Path:
"""Locate the skill root directory from extracted archive contents.
Filters out macOS metadata (__MACOSX) and dotfiles (.DS_Store).
Returns:
Path to the skill directory.
Raises:
ValueError: If the archive is empty after filtering.
"""
items = [p for p in temp_path.iterdir() if not should_ignore_archive_entry(p)]
if not items:
raise ValueError("Skill archive is empty")
if len(items) == 1 and items[0].is_dir():
return items[0]
return temp_path
def safe_extract_skill_archive(
zip_ref: zipfile.ZipFile,
dest_path: Path,
max_total_size: int = 512 * 1024 * 1024,
) -> None:
"""Safely extract a skill archive with security protections.
Protections:
- Reject absolute paths and directory traversal (..).
- Skip symlink entries instead of materialising them.
- Enforce a hard limit on total uncompressed size (zip bomb defence).
Raises:
ValueError: If unsafe members or size limit exceeded.
"""
dest_root = dest_path.resolve()
total_written = 0
for info in zip_ref.infolist():
if is_unsafe_zip_member(info):
raise ValueError(f"Archive contains unsafe member path: {info.filename!r}")
if is_symlink_member(info):
logger.warning("Skipping symlink entry in skill archive: %s", info.filename)
continue
normalized_name = posixpath.normpath(info.filename.replace("\\", "/"))
member_path = dest_root.joinpath(*PurePosixPath(normalized_name).parts)
refactor: extract shared skill installer and upload manager to harness (#1202) * refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00
if not member_path.resolve().is_relative_to(dest_root):
raise ValueError(f"Zip entry escapes destination: {info.filename!r}")
member_path.parent.mkdir(parents=True, exist_ok=True)
if info.is_dir():
member_path.mkdir(parents=True, exist_ok=True)
continue
with zip_ref.open(info) as src, member_path.open("wb") as dst:
while chunk := src.read(65536):
total_written += len(chunk)
if total_written > max_total_size:
raise ValueError("Skill archive is too large or appears highly compressed.")
dst.write(chunk)
def install_skill_from_archive(
zip_path: str | Path,
*,
skills_root: Path | None = None,
) -> dict:
"""Install a skill from a .skill archive (ZIP).
Args:
zip_path: Path to the .skill file.
skills_root: Override the skills root directory. If None, uses
the default from config.
Returns:
Dict with success, skill_name, message.
Raises:
FileNotFoundError: If the file does not exist.
ValueError: If the file is invalid (wrong extension, bad ZIP,
invalid frontmatter, duplicate name).
"""
logger.info("Installing skill from %s", zip_path)
path = Path(zip_path)
if not path.is_file():
if not path.exists():
raise FileNotFoundError(f"Skill file not found: {zip_path}")
raise ValueError(f"Path is not a file: {zip_path}")
if path.suffix != ".skill":
raise ValueError("File must have .skill extension")
if skills_root is None:
skills_root = get_skills_root_path()
custom_dir = skills_root / "custom"
custom_dir.mkdir(parents=True, exist_ok=True)
with tempfile.TemporaryDirectory() as tmp:
tmp_path = Path(tmp)
try:
zf = zipfile.ZipFile(path, "r")
except FileNotFoundError:
raise FileNotFoundError(f"Skill file not found: {zip_path}") from None
except (zipfile.BadZipFile, IsADirectoryError):
raise ValueError("File is not a valid ZIP archive") from None
with zf:
safe_extract_skill_archive(zf, tmp_path)
skill_dir = resolve_skill_dir_from_archive(tmp_path)
is_valid, message, skill_name = _validate_skill_frontmatter(skill_dir)
if not is_valid:
raise ValueError(f"Invalid skill: {message}")
if not skill_name or "/" in skill_name or "\\" in skill_name or ".." in skill_name:
raise ValueError(f"Invalid skill name: {skill_name}")
target = custom_dir / skill_name
if target.exists():
raise SkillAlreadyExistsError(f"Skill '{skill_name}' already exists")
shutil.copytree(skill_dir, target)
logger.info("Skill %r installed to %s", skill_name, target)
return {
"success": True,
"skill_name": skill_name,
"message": f"Skill '{skill_name}' installed successfully",
}