* refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
8.8 KiB
RFC: Extract Shared Skill Installer and Upload Manager into Harness
1. Problem
Gateway (app/gateway/routers/skills.py, uploads.py) and Client (deerflow/client.py) each independently implement the same business logic:
Skill Installation
| Logic | Gateway (skills.py) |
Client (client.py) |
|---|---|---|
| Zip safety check | _is_unsafe_zip_member() |
Inline Path(info.filename).is_absolute() |
| Symlink filtering | _is_symlink_member() |
p.is_symlink() post-extraction delete |
| Zip bomb defence | total_size += info.file_size (declared) |
total_size > 100MB (declared) |
| macOS metadata filter | _should_ignore_archive_entry() |
None |
| Frontmatter validation | _validate_skill_frontmatter() |
_validate_skill_frontmatter() |
| Duplicate detection | HTTPException(409) |
ValueError |
Two implementations, inconsistent behaviour: Gateway streams writes and tracks real decompressed size; Client sums declared file_size. Gateway skips symlinks during extraction; Client extracts everything then walks and deletes symlinks.
Upload Management
| Logic | Gateway (uploads.py) |
Client (client.py) |
|---|---|---|
| Directory access | get_uploads_dir() + mkdir |
_get_uploads_dir() + mkdir |
| Filename safety | Inline Path(f).name + manual checks |
No checks, uses src_path.name directly |
| Duplicate handling | None (overwrites) | None (overwrites) |
| Listing | Inline iterdir() |
Inline os.scandir() |
| Deletion | Inline unlink() + traversal check |
Inline unlink() + traversal check |
| Path traversal | resolve().relative_to() |
resolve().relative_to() |
The same traversal check is written twice — any security fix must be applied to both locations.
2. Design Principles
Dependency Direction
app.gateway.routers.skills ──┐
app.gateway.routers.uploads ──┤── calls ──→ deerflow.skills.installer
deerflow.client ──┘ deerflow.uploads.manager
- Shared modules live in the harness layer (
deerflow.*), pure business logic, no FastAPI dependency - Gateway handles HTTP adaptation (
UploadFile→ bytes, exceptions →HTTPException) - Client handles local adaptation (
Path→ copy, exceptions → Python exceptions) - Satisfies
test_harness_boundary.pyconstraint: harness never imports app
Exception Strategy
| Shared Layer Exception | Gateway Maps To | Client |
|---|---|---|
FileNotFoundError |
HTTPException(404) |
Propagates |
ValueError |
HTTPException(400) |
Propagates |
SkillAlreadyExistsError |
HTTPException(409) |
Propagates |
PermissionError |
HTTPException(403) |
Propagates |
Replaces stringly-typed routing ("already exists" in str(e)) with typed exception matching (SkillAlreadyExistsError).
3. New Modules
3.1 deerflow.skills.installer
# Safety checks
is_unsafe_zip_member(info: ZipInfo) -> bool # Absolute path / .. traversal
is_symlink_member(info: ZipInfo) -> bool # Unix symlink detection
should_ignore_archive_entry(path: Path) -> bool # __MACOSX / dotfiles
# Extraction
safe_extract_skill_archive(zip_ref, dest_path, max_total_size=512MB)
# Streaming write, accumulates real bytes (vs declared file_size)
# Dual traversal check: member-level + resolve-level
# Directory resolution
resolve_skill_dir_from_archive(temp_path: Path) -> Path
# Auto-enters single directory, filters macOS metadata
# Install entry point
install_skill_from_archive(zip_path, *, skills_root=None) -> dict
# is_file() pre-check before extension validation
# SkillAlreadyExistsError replaces ValueError
# Exception
class SkillAlreadyExistsError(ValueError)
3.2 deerflow.uploads.manager
# Directory management
get_uploads_dir(thread_id: str) -> Path # Pure path, no side effects
ensure_uploads_dir(thread_id: str) -> Path # Creates directory (for write paths)
# Filename safety
normalize_filename(filename: str) -> str
# Path.name extraction + rejects ".." / "." / backslash / >255 bytes
deduplicate_filename(name: str, seen: set) -> str
# _N suffix increment for dedup, mutates seen in place
# Path safety
validate_path_traversal(path: Path, base: Path) -> None
# resolve().relative_to(), raises PermissionError on failure
# File operations
list_files_in_dir(directory: Path) -> dict
# scandir with stat inside context (no re-stat)
# follow_symlinks=False to prevent metadata leakage
# Non-existent directory returns empty list
delete_file_safe(base_dir: Path, filename: str) -> dict
# Validates traversal first, then unlinks
# URL helpers
upload_artifact_url(thread_id, filename) -> str # Percent-encoded for HTTP safety
upload_virtual_path(filename) -> str # Sandbox-internal path
enrich_file_listing(result, thread_id) -> dict # Adds URLs, stringifies sizes
4. Changes
4.1 Gateway Slimming
app/gateway/routers/skills.py:
- Remove
_is_unsafe_zip_member,_is_symlink_member,_safe_extract_skill_archive,_should_ignore_archive_entry,_resolve_skill_dir_from_archive_root(~80 lines) install_skillroute becomes a single call toinstall_skill_from_archive(path)- Exception mapping:
SkillAlreadyExistsError → 409,ValueError → 400,FileNotFoundError → 404
app/gateway/routers/uploads.py:
- Remove inline
get_uploads_dir(replaced byensure_uploads_dir/get_uploads_dir) upload_filesusesnormalize_filename()instead of inline safety checkslist_uploaded_filesuseslist_files_in_dir()+ enrichmentdelete_uploaded_fileusesdelete_file_safe()+ companion markdown cleanup
4.2 Client Slimming
deerflow/client.py:
- Remove
_get_uploads_dirstatic method - Remove ~50 lines of inline zip handling in
install_skill install_skilldelegates toinstall_skill_from_archive()upload_filesusesdeduplicate_filename()+ensure_uploads_dir()list_uploadsusesget_uploads_dir()+list_files_in_dir()delete_uploadusesget_uploads_dir()+delete_file_safe()update_mcp_config/update_skillnow reset_agent_config_key = None
4.3 Read/Write Path Separation
| Operation | Function | Creates dir? |
|---|---|---|
| upload (write) | ensure_uploads_dir() |
Yes |
| list (read) | get_uploads_dir() |
No |
| delete (read) | get_uploads_dir() |
No |
Read paths no longer have mkdir side effects — non-existent directories return empty lists.
5. Security Improvements
| Improvement | Before | After |
|---|---|---|
| Zip bomb detection | Sum of declared file_size |
Streaming write, accumulates real bytes |
| Symlink handling | Gateway skips / Client deletes post-extract | Unified skip + log |
| Traversal check | Member-level only | Member-level + resolve().is_relative_to() |
| Filename backslash | Gateway checks / Client doesn't | Unified rejection |
| Filename length | No check | Reject > 255 bytes (OS limit) |
| thread_id validation | None | Reject unsafe filesystem characters |
| Listing symlink leak | follow_symlinks=True (default) |
follow_symlinks=False |
| 409 status routing | "already exists" in str(e) |
SkillAlreadyExistsError type match |
| Artifact URL encoding | Raw filename in URL | urllib.parse.quote() |
6. Alternatives Considered
| Alternative | Why Not |
|---|---|
| Keep logic in Gateway, Client calls Gateway via HTTP | Adds network dependency to embedded Client; defeats the purpose of DeerFlowClient as an in-process API |
| Abstract base class with Gateway/Client subclasses | Over-engineered for what are pure functions; no polymorphism needed |
Move everything into client.py and have Gateway import it |
Violates harness/app boundary — Client is in harness, but Gateway-specific models (Pydantic response types) should stay in app layer |
| Merge Gateway and Client into one module | They serve different consumers (HTTP vs in-process) with different adaptation needs |
7. Breaking Changes
None. All public APIs (Gateway HTTP endpoints, DeerFlowClient methods) retain their existing signatures and return formats. The SkillAlreadyExistsError is a subclass of ValueError, so existing except ValueError handlers still catch it.
8. Tests
| Module | Test File | Count |
|---|---|---|
skills.installer |
tests/test_skills_installer.py |
22 |
uploads.manager |
tests/test_uploads_manager.py |
20 |
client hardening |
tests/test_client.py (new cases) |
~40 |
client e2e |
tests/test_client_e2e.py (new file) |
~20 |
Coverage: unsafe zip / symlink / zip bomb / frontmatter / duplicate / extension / macOS filter / normalize / deduplicate / traversal / list / delete / agent invalidation / upload lifecycle / thread isolation / URL encoding / config pollution.