refactor: extract shared skill installer and upload manager to harness (#1202)

* refactor: extract shared skill installer and upload manager to harness

Move duplicated business logic from Gateway routers and Client into
shared harness modules, eliminating code duplication.

New shared modules:
- deerflow.skills.installer: 6 functions (zip security, extraction, install)
- deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate,
  list, delete, get_uploads_dir, ensure_uploads_dir)

Key improvements:
- SkillAlreadyExistsError replaces stringly-typed 409 status routing
- normalize_filename rejects backslash-containing filenames
- Read paths (list/delete) no longer mkdir via get_uploads_dir
- Write paths use ensure_uploads_dir for explicit directory creation
- list_files_in_dir does stat inside scandir context (no re-stat)
- install_skill_from_archive uses single is_file() check (one syscall)
- Fix agent config key not reset on update_mcp_config/update_skill

Tests: 42 new (22 installer + 20 upload manager) + client hardening

* refactor: centralize upload URL construction and clean up installer

- Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing()
  into shared manager.py, eliminating 6 duplicated URL constructions across
  Gateway router and Client
- Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of
  hardcoded "mnt/user-data/uploads" strings
- Eliminate TOCTOU pre-checks and double file read in installer — single
  ZipFile() open with exception handling replaces is_file() + is_zipfile()
  + ZipFile() sequence
- Add missing re-exports: ensure_uploads_dir in uploads/__init__.py,
  SkillAlreadyExistsError in skills/__init__.py
- Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS
- Hoist sandbox_uploads_dir(thread_id) before loop in uploads router

* fix: add input validation for thread_id and filename length

- Reject thread_id containing unsafe filesystem characters (only allow
  alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs
  like <script> or shell metacharacters
- Reject filenames longer than 255 bytes (OS limit) in normalize_filename
- Gateway upload router maps ValueError to 400 for invalid thread_id

* fix: address PR review — symlink safety, input validation coverage, error ordering

- list_files_in_dir: use follow_symlinks=False to prevent symlink metadata
  leakage; check is_dir() instead of exists() for non-directory paths
- install_skill_from_archive: restore is_file() pre-check before extension
  validation so error messages match the documented exception contract
- validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so
  all entry points (upload/list/delete) are protected
- delete_uploaded_file: catch ValueError from thread_id validation (was 500)
- requires_llm marker: also skip when OPENAI_API_KEY is unset
- e2e fixture: update TitleMiddleware exclusion comment (kept filtering —
  middleware triggers extra LLM calls that add non-determinism to tests)

* chore: revert uv.lock to main — no dependency changes in this PR

* fix: use monkeypatch for global config in e2e fixture to prevent test pollution

The e2e_env fixture was calling set_title_config() and
set_summarization_config() directly, which mutated global singletons
without automatic cleanup. When pytest ran test_client_e2e.py before
test_title_middleware_core_logic.py, the leaked enabled=False caused
5 title tests to fail in CI.

Switched to monkeypatch.setattr on the module-level private variables
so pytest restores the originals after each test.

* fix: address code review — URL encoding, API consistency, test isolation

- upload_artifact_url: percent-encode filename to handle spaces/#/?
- deduplicate_filename: mutate seen set in place (caller no longer
  needs manual .add() — less error-prone API)
- list_files_in_dir: document that size is int, enrich stringifies
- e2e fixture: monkeypatch _app_config instead of set_app_config()
  to prevent global singleton pollution (same pattern as title/summarization fix)
- _make_e2e_config: read LLM connection details from env vars so
  external contributors can override defaults
- Update tests to match new deduplicate_filename contract

* docs: rewrite RFC in English and add alternatives/breaking changes sections

* fix: address code review feedback on PR #1202

- Rename deduplicate_filename to claim_unique_filename to make
  the in-place set mutation explicit in the function name
- Replace PermissionError with PathTraversalError(ValueError) for
  path traversal detection — malformed input is 400, not 403

* fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI

---------

Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
This commit is contained in:
greatmengqi
2026-03-25 16:28:33 +08:00
committed by GitHub
parent ec46ae075d
commit b8bc80d89b
14 changed files with 2591 additions and 567 deletions

View File

@@ -19,12 +19,9 @@ import asyncio
import json
import logging
import mimetypes
import os
import re
import shutil
import tempfile
import uuid
import zipfile
from collections.abc import Generator
from dataclasses import dataclass, field
from pathlib import Path
@@ -42,6 +39,17 @@ from deerflow.config.app_config import get_app_config, reload_app_config
from deerflow.config.extensions_config import ExtensionsConfig, SkillStateConfig, get_extensions_config, reload_extensions_config
from deerflow.config.paths import get_paths
from deerflow.models import create_chat_model
from deerflow.skills.installer import install_skill_from_archive
from deerflow.uploads.manager import (
claim_unique_filename,
delete_file_safe,
enrich_file_listing,
ensure_uploads_dir,
get_uploads_dir,
list_files_in_dir,
upload_artifact_url,
upload_virtual_path,
)
logger = logging.getLogger(__name__)
@@ -566,6 +574,7 @@ class DeerFlowClient:
self._atomic_write_json(config_path, config_data)
self._agent = None
self._agent_config_key = None
reloaded = reload_extensions_config()
return {"mcp_servers": {name: server.model_dump() for name, server in reloaded.mcp_servers.items()}}
@@ -631,6 +640,7 @@ class DeerFlowClient:
self._atomic_write_json(config_path, config_data)
self._agent = None
self._agent_config_key = None
reload_extensions_config()
updated = next((s for s in load_skills(enabled_only=False) if s.name == name), None)
@@ -657,56 +667,7 @@ class DeerFlowClient:
FileNotFoundError: If the file does not exist.
ValueError: If the file is invalid.
"""
from deerflow.skills.loader import get_skills_root_path
from deerflow.skills.validation import _validate_skill_frontmatter
path = Path(skill_path)
if not path.exists():
raise FileNotFoundError(f"Skill file not found: {skill_path}")
if not path.is_file():
raise ValueError(f"Path is not a file: {skill_path}")
if path.suffix != ".skill":
raise ValueError("File must have .skill extension")
if not zipfile.is_zipfile(path):
raise ValueError("File is not a valid ZIP archive")
skills_root = get_skills_root_path()
custom_dir = skills_root / "custom"
custom_dir.mkdir(parents=True, exist_ok=True)
with tempfile.TemporaryDirectory() as tmp:
tmp_path = Path(tmp)
with zipfile.ZipFile(path, "r") as zf:
total_size = sum(info.file_size for info in zf.infolist())
if total_size > 100 * 1024 * 1024:
raise ValueError("Skill archive too large when extracted (>100MB)")
for info in zf.infolist():
if Path(info.filename).is_absolute() or ".." in Path(info.filename).parts:
raise ValueError(f"Unsafe path in archive: {info.filename}")
zf.extractall(tmp_path)
for p in tmp_path.rglob("*"):
if p.is_symlink():
p.unlink()
items = list(tmp_path.iterdir())
if not items:
raise ValueError("Skill archive is empty")
skill_dir = items[0] if len(items) == 1 and items[0].is_dir() else tmp_path
is_valid, message, skill_name = _validate_skill_frontmatter(skill_dir)
if not is_valid:
raise ValueError(f"Invalid skill: {message}")
if not re.fullmatch(r"[a-zA-Z0-9_-]+", skill_name):
raise ValueError(f"Invalid skill name: {skill_name}")
target = custom_dir / skill_name
if target.exists():
raise ValueError(f"Skill '{skill_name}' already exists")
shutil.copytree(skill_dir, target)
return {"success": True, "skill_name": skill_name, "message": f"Skill '{skill_name}' installed successfully"}
return install_skill_from_archive(skill_path)
# ------------------------------------------------------------------
# Public API — memory management
@@ -756,13 +717,6 @@ class DeerFlowClient:
# Public API — file uploads
# ------------------------------------------------------------------
@staticmethod
def _get_uploads_dir(thread_id: str) -> Path:
"""Get (and create) the uploads directory for a thread."""
base = get_paths().sandbox_uploads_dir(thread_id)
base.mkdir(parents=True, exist_ok=True)
return base
def upload_files(self, thread_id: str, files: list[str | Path]) -> dict:
"""Upload local files into a thread's uploads directory.
@@ -784,7 +738,7 @@ class DeerFlowClient:
# Validate all files upfront to avoid partial uploads.
resolved_files = []
convertible_extensions = {ext.lower() for ext in CONVERTIBLE_EXTENSIONS}
seen_names: set[str] = set()
has_convertible_file = False
for f in files:
p = Path(f)
@@ -792,11 +746,12 @@ class DeerFlowClient:
raise FileNotFoundError(f"File not found: {f}")
if not p.is_file():
raise ValueError(f"Path is not a file: {f}")
resolved_files.append(p)
if not has_convertible_file and p.suffix.lower() in convertible_extensions:
dest_name = claim_unique_filename(p.name, seen_names)
resolved_files.append((p, dest_name))
if not has_convertible_file and p.suffix.lower() in CONVERTIBLE_EXTENSIONS:
has_convertible_file = True
uploads_dir = self._get_uploads_dir(thread_id)
uploads_dir = ensure_uploads_dir(thread_id)
uploaded_files: list[dict] = []
conversion_pool = None
@@ -816,19 +771,21 @@ class DeerFlowClient:
return asyncio.run(convert_file_to_markdown(path))
try:
for src_path in resolved_files:
dest = uploads_dir / src_path.name
for src_path, dest_name in resolved_files:
dest = uploads_dir / dest_name
shutil.copy2(src_path, dest)
info: dict[str, Any] = {
"filename": src_path.name,
"filename": dest_name,
"size": str(dest.stat().st_size),
"path": str(dest),
"virtual_path": f"/mnt/user-data/uploads/{src_path.name}",
"artifact_url": f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{src_path.name}",
"virtual_path": upload_virtual_path(dest_name),
"artifact_url": upload_artifact_url(thread_id, dest_name),
}
if dest_name != src_path.name:
info["original_filename"] = src_path.name
if src_path.suffix.lower() in convertible_extensions:
if src_path.suffix.lower() in CONVERTIBLE_EXTENSIONS:
try:
if conversion_pool is not None:
md_path = conversion_pool.submit(_convert_in_thread, dest).result()
@@ -844,8 +801,9 @@ class DeerFlowClient:
if md_path is not None:
info["markdown_file"] = md_path.name
info["markdown_virtual_path"] = f"/mnt/user-data/uploads/{md_path.name}"
info["markdown_artifact_url"] = f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{md_path.name}"
info["markdown_path"] = str(uploads_dir / md_path.name)
info["markdown_virtual_path"] = upload_virtual_path(md_path.name)
info["markdown_artifact_url"] = upload_artifact_url(thread_id, md_path.name)
uploaded_files.append(info)
finally:
@@ -868,29 +826,9 @@ class DeerFlowClient:
Dict with "files" and "count" keys, matching the Gateway API
``list_uploaded_files`` response.
"""
uploads_dir = self._get_uploads_dir(thread_id)
if not uploads_dir.exists():
return {"files": [], "count": 0}
files = []
with os.scandir(uploads_dir) as entries:
file_entries = [entry for entry in entries if entry.is_file()]
for entry in sorted(file_entries, key=lambda item: item.name):
stat = entry.stat()
filename = entry.name
files.append(
{
"filename": filename,
"size": str(stat.st_size),
"path": str(Path(entry.path)),
"virtual_path": f"/mnt/user-data/uploads/{filename}",
"artifact_url": f"/api/threads/{thread_id}/artifacts/mnt/user-data/uploads/{filename}",
"extension": Path(filename).suffix,
"modified": stat.st_mtime,
}
)
return {"files": files, "count": len(files)}
uploads_dir = get_uploads_dir(thread_id)
result = list_files_in_dir(uploads_dir)
return enrich_file_listing(result, thread_id)
def delete_upload(self, thread_id: str, filename: str) -> dict:
"""Delete a file from a thread's uploads directory.
@@ -907,19 +845,10 @@ class DeerFlowClient:
FileNotFoundError: If the file does not exist.
PermissionError: If path traversal is detected.
"""
uploads_dir = self._get_uploads_dir(thread_id)
file_path = (uploads_dir / filename).resolve()
from deerflow.utils.file_conversion import CONVERTIBLE_EXTENSIONS
try:
file_path.relative_to(uploads_dir.resolve())
except ValueError as exc:
raise PermissionError("Access denied: path traversal detected") from exc
if not file_path.is_file():
raise FileNotFoundError(f"File not found: {filename}")
file_path.unlink()
return {"success": True, "message": f"Deleted {filename}"}
uploads_dir = get_uploads_dir(thread_id)
return delete_file_safe(uploads_dir, filename, convertible_extensions=CONVERTIBLE_EXTENSIONS)
# ------------------------------------------------------------------
# Public API — artifacts
@@ -939,19 +868,13 @@ class DeerFlowClient:
FileNotFoundError: If the artifact does not exist.
ValueError: If the path is invalid.
"""
virtual_prefix = "mnt/user-data"
clean_path = path.lstrip("/")
if not clean_path.startswith(virtual_prefix):
raise ValueError(f"Path must start with /{virtual_prefix}")
relative = clean_path[len(virtual_prefix) :].lstrip("/")
base_dir = get_paths().sandbox_user_data_dir(thread_id)
actual = (base_dir / relative).resolve()
try:
actual.relative_to(base_dir.resolve())
actual = get_paths().resolve_virtual_path(thread_id, path)
except ValueError as exc:
raise PermissionError("Access denied: path traversal detected") from exc
if "traversal" in str(exc):
from deerflow.uploads.manager import PathTraversalError
raise PathTraversalError("Path traversal detected") from exc
raise
if not actual.exists():
raise FileNotFoundError(f"Artifact not found: {path}")
if not actual.is_file():

View File

@@ -1,5 +1,14 @@
from .installer import SkillAlreadyExistsError, install_skill_from_archive
from .loader import get_skills_root_path, load_skills
from .types import Skill
from .validation import ALLOWED_FRONTMATTER_PROPERTIES, _validate_skill_frontmatter
__all__ = ["load_skills", "get_skills_root_path", "Skill", "ALLOWED_FRONTMATTER_PROPERTIES", "_validate_skill_frontmatter"]
__all__ = [
"load_skills",
"get_skills_root_path",
"Skill",
"ALLOWED_FRONTMATTER_PROPERTIES",
"_validate_skill_frontmatter",
"install_skill_from_archive",
"SkillAlreadyExistsError",
]

View File

@@ -0,0 +1,176 @@
"""Shared skill archive installation logic.
Pure business logic — no FastAPI/HTTP dependencies.
Both Gateway and Client delegate to these functions.
"""
import logging
import shutil
import stat
import tempfile
import zipfile
from pathlib import Path
from deerflow.skills.loader import get_skills_root_path
from deerflow.skills.validation import _validate_skill_frontmatter
logger = logging.getLogger(__name__)
class SkillAlreadyExistsError(ValueError):
"""Raised when a skill with the same name is already installed."""
def is_unsafe_zip_member(info: zipfile.ZipInfo) -> bool:
"""Return True if the zip member path is absolute or attempts directory traversal."""
name = info.filename
if not name:
return False
path = Path(name)
if path.is_absolute():
return True
if ".." in path.parts:
return True
return False
def is_symlink_member(info: zipfile.ZipInfo) -> bool:
"""Detect symlinks based on the external attributes stored in the ZipInfo."""
mode = info.external_attr >> 16
return stat.S_ISLNK(mode)
def should_ignore_archive_entry(path: Path) -> bool:
"""Return True for macOS metadata dirs and dotfiles."""
return path.name.startswith(".") or path.name == "__MACOSX"
def resolve_skill_dir_from_archive(temp_path: Path) -> Path:
"""Locate the skill root directory from extracted archive contents.
Filters out macOS metadata (__MACOSX) and dotfiles (.DS_Store).
Returns:
Path to the skill directory.
Raises:
ValueError: If the archive is empty after filtering.
"""
items = [p for p in temp_path.iterdir() if not should_ignore_archive_entry(p)]
if not items:
raise ValueError("Skill archive is empty")
if len(items) == 1 and items[0].is_dir():
return items[0]
return temp_path
def safe_extract_skill_archive(
zip_ref: zipfile.ZipFile,
dest_path: Path,
max_total_size: int = 512 * 1024 * 1024,
) -> None:
"""Safely extract a skill archive with security protections.
Protections:
- Reject absolute paths and directory traversal (..).
- Skip symlink entries instead of materialising them.
- Enforce a hard limit on total uncompressed size (zip bomb defence).
Raises:
ValueError: If unsafe members or size limit exceeded.
"""
dest_root = dest_path.resolve()
total_written = 0
for info in zip_ref.infolist():
if is_unsafe_zip_member(info):
raise ValueError(f"Archive contains unsafe member path: {info.filename!r}")
if is_symlink_member(info):
logger.warning("Skipping symlink entry in skill archive: %s", info.filename)
continue
member_path = dest_root / info.filename
if not member_path.resolve().is_relative_to(dest_root):
raise ValueError(f"Zip entry escapes destination: {info.filename!r}")
member_path.parent.mkdir(parents=True, exist_ok=True)
if info.is_dir():
member_path.mkdir(parents=True, exist_ok=True)
continue
with zip_ref.open(info) as src, member_path.open("wb") as dst:
while chunk := src.read(65536):
total_written += len(chunk)
if total_written > max_total_size:
raise ValueError("Skill archive is too large or appears highly compressed.")
dst.write(chunk)
def install_skill_from_archive(
zip_path: str | Path,
*,
skills_root: Path | None = None,
) -> dict:
"""Install a skill from a .skill archive (ZIP).
Args:
zip_path: Path to the .skill file.
skills_root: Override the skills root directory. If None, uses
the default from config.
Returns:
Dict with success, skill_name, message.
Raises:
FileNotFoundError: If the file does not exist.
ValueError: If the file is invalid (wrong extension, bad ZIP,
invalid frontmatter, duplicate name).
"""
logger.info("Installing skill from %s", zip_path)
path = Path(zip_path)
if not path.is_file():
if not path.exists():
raise FileNotFoundError(f"Skill file not found: {zip_path}")
raise ValueError(f"Path is not a file: {zip_path}")
if path.suffix != ".skill":
raise ValueError("File must have .skill extension")
if skills_root is None:
skills_root = get_skills_root_path()
custom_dir = skills_root / "custom"
custom_dir.mkdir(parents=True, exist_ok=True)
with tempfile.TemporaryDirectory() as tmp:
tmp_path = Path(tmp)
try:
zf = zipfile.ZipFile(path, "r")
except FileNotFoundError:
raise FileNotFoundError(f"Skill file not found: {zip_path}") from None
except (zipfile.BadZipFile, IsADirectoryError):
raise ValueError("File is not a valid ZIP archive") from None
with zf:
safe_extract_skill_archive(zf, tmp_path)
skill_dir = resolve_skill_dir_from_archive(tmp_path)
is_valid, message, skill_name = _validate_skill_frontmatter(skill_dir)
if not is_valid:
raise ValueError(f"Invalid skill: {message}")
if not skill_name or "/" in skill_name or "\\" in skill_name or ".." in skill_name:
raise ValueError(f"Invalid skill name: {skill_name}")
target = custom_dir / skill_name
if target.exists():
raise SkillAlreadyExistsError(f"Skill '{skill_name}' already exists")
shutil.copytree(skill_dir, target)
logger.info("Skill %r installed to %s", skill_name, target)
return {
"success": True,
"skill_name": skill_name,
"message": f"Skill '{skill_name}' installed successfully",
}

View File

@@ -0,0 +1,29 @@
from .manager import (
PathTraversalError,
claim_unique_filename,
delete_file_safe,
enrich_file_listing,
ensure_uploads_dir,
get_uploads_dir,
list_files_in_dir,
normalize_filename,
upload_artifact_url,
upload_virtual_path,
validate_path_traversal,
validate_thread_id,
)
__all__ = [
"get_uploads_dir",
"ensure_uploads_dir",
"normalize_filename",
"PathTraversalError",
"claim_unique_filename",
"validate_path_traversal",
"list_files_in_dir",
"delete_file_safe",
"upload_artifact_url",
"upload_virtual_path",
"enrich_file_listing",
"validate_thread_id",
]

View File

@@ -0,0 +1,198 @@
"""Shared upload management logic.
Pure business logic — no FastAPI/HTTP dependencies.
Both Gateway and Client delegate to these functions.
"""
import os
import re
from pathlib import Path
from urllib.parse import quote
from deerflow.config.paths import VIRTUAL_PATH_PREFIX, get_paths
class PathTraversalError(ValueError):
"""Raised when a path escapes its allowed base directory."""
# thread_id must be alphanumeric, hyphens, underscores, or dots only.
_SAFE_THREAD_ID = re.compile(r"^[a-zA-Z0-9._-]+$")
def validate_thread_id(thread_id: str) -> None:
"""Reject thread IDs containing characters unsafe for filesystem paths.
Raises:
ValueError: If thread_id is empty or contains unsafe characters.
"""
if not thread_id or not _SAFE_THREAD_ID.match(thread_id):
raise ValueError(f"Invalid thread_id: {thread_id!r}")
def get_uploads_dir(thread_id: str) -> Path:
"""Return the uploads directory path for a thread (no side effects)."""
validate_thread_id(thread_id)
return get_paths().sandbox_uploads_dir(thread_id)
def ensure_uploads_dir(thread_id: str) -> Path:
"""Return the uploads directory for a thread, creating it if needed."""
base = get_uploads_dir(thread_id)
base.mkdir(parents=True, exist_ok=True)
return base
def normalize_filename(filename: str) -> str:
"""Sanitize a filename by extracting its basename.
Strips any directory components and rejects traversal patterns.
Args:
filename: Raw filename from user input (may contain path components).
Returns:
Safe filename (basename only).
Raises:
ValueError: If filename is empty or resolves to a traversal pattern.
"""
if not filename:
raise ValueError("Filename is empty")
safe = Path(filename).name
if not safe or safe in {".", ".."}:
raise ValueError(f"Filename is unsafe: {filename!r}")
# Reject backslashes — on Linux Path.name keeps them as literal chars,
# but they indicate a Windows-style path that should be stripped or rejected.
if "\\" in safe:
raise ValueError(f"Filename contains backslash: {filename!r}")
if len(safe.encode("utf-8")) > 255:
raise ValueError(f"Filename too long: {len(safe)} chars")
return safe
def claim_unique_filename(name: str, seen: set[str]) -> str:
"""Generate a unique filename by appending ``_N`` suffix on collision.
Automatically adds the returned name to *seen* so callers don't need to.
Args:
name: Candidate filename.
seen: Set of filenames already claimed (mutated in place).
Returns:
A filename not present in *seen* (already added to *seen*).
"""
if name not in seen:
seen.add(name)
return name
stem, suffix = Path(name).stem, Path(name).suffix
counter = 1
candidate = f"{stem}_{counter}{suffix}"
while candidate in seen:
counter += 1
candidate = f"{stem}_{counter}{suffix}"
seen.add(candidate)
return candidate
def validate_path_traversal(path: Path, base: Path) -> None:
"""Verify that *path* is inside *base*.
Raises:
PathTraversalError: If a path traversal is detected.
"""
try:
path.resolve().relative_to(base.resolve())
except ValueError:
raise PathTraversalError("Path traversal detected") from None
def list_files_in_dir(directory: Path) -> dict:
"""List files (not directories) in *directory*.
Args:
directory: Directory to scan.
Returns:
Dict with "files" list (sorted by name) and "count".
Each file entry has ``size`` as *int* (bytes). Call
:func:`enrich_file_listing` to stringify sizes and add
virtual / artifact URLs.
"""
if not directory.is_dir():
return {"files": [], "count": 0}
files = []
with os.scandir(directory) as entries:
for entry in sorted(entries, key=lambda e: e.name):
if not entry.is_file(follow_symlinks=False):
continue
st = entry.stat(follow_symlinks=False)
files.append({
"filename": entry.name,
"size": st.st_size,
"path": entry.path,
"extension": Path(entry.name).suffix,
"modified": st.st_mtime,
})
return {"files": files, "count": len(files)}
def delete_file_safe(base_dir: Path, filename: str, *, convertible_extensions: set[str] | None = None) -> dict:
"""Delete a file inside *base_dir* after path-traversal validation.
If *convertible_extensions* is provided and the file's extension matches,
the companion ``.md`` file is also removed (if it exists).
Args:
base_dir: Directory containing the file.
filename: Name of file to delete.
convertible_extensions: Lowercase extensions (e.g. ``{".pdf", ".docx"}``)
whose companion markdown should be cleaned up.
Returns:
Dict with success and message.
Raises:
FileNotFoundError: If the file does not exist.
PathTraversalError: If path traversal is detected.
"""
file_path = (base_dir / filename).resolve()
validate_path_traversal(file_path, base_dir)
if not file_path.is_file():
raise FileNotFoundError(f"File not found: {filename}")
file_path.unlink()
# Clean up companion markdown generated during upload conversion.
if convertible_extensions and file_path.suffix.lower() in convertible_extensions:
file_path.with_suffix(".md").unlink(missing_ok=True)
return {"success": True, "message": f"Deleted {filename}"}
def upload_artifact_url(thread_id: str, filename: str) -> str:
"""Build the artifact URL for a file in a thread's uploads directory.
*filename* is percent-encoded so that spaces, ``#``, ``?`` etc. are safe.
"""
return f"/api/threads/{thread_id}/artifacts{VIRTUAL_PATH_PREFIX}/uploads/{quote(filename, safe='')}"
def upload_virtual_path(filename: str) -> str:
"""Build the virtual path for a file in the uploads directory."""
return f"{VIRTUAL_PATH_PREFIX}/uploads/{filename}"
def enrich_file_listing(result: dict, thread_id: str) -> dict:
"""Add virtual paths, artifact URLs, and stringify sizes on a listing result.
Mutates *result* in place and returns it for convenience.
"""
for f in result["files"]:
filename = f["filename"]
f["size"] = str(f["size"])
f["virtual_path"] = upload_virtual_path(filename)
f["artifact_url"] = upload_artifact_url(thread_id, filename)
return result