Files
deer-flow/backend/docs/rfc-extract-shared-modules.md
greatmengqi b8bc80d89b refactor: extract shared skill installer and upload manager to harness (#1202)
* refactor: extract shared skill installer and upload manager to harness

Move duplicated business logic from Gateway routers and Client into
shared harness modules, eliminating code duplication.

New shared modules:
- deerflow.skills.installer: 6 functions (zip security, extraction, install)
- deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate,
  list, delete, get_uploads_dir, ensure_uploads_dir)

Key improvements:
- SkillAlreadyExistsError replaces stringly-typed 409 status routing
- normalize_filename rejects backslash-containing filenames
- Read paths (list/delete) no longer mkdir via get_uploads_dir
- Write paths use ensure_uploads_dir for explicit directory creation
- list_files_in_dir does stat inside scandir context (no re-stat)
- install_skill_from_archive uses single is_file() check (one syscall)
- Fix agent config key not reset on update_mcp_config/update_skill

Tests: 42 new (22 installer + 20 upload manager) + client hardening

* refactor: centralize upload URL construction and clean up installer

- Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing()
  into shared manager.py, eliminating 6 duplicated URL constructions across
  Gateway router and Client
- Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of
  hardcoded "mnt/user-data/uploads" strings
- Eliminate TOCTOU pre-checks and double file read in installer — single
  ZipFile() open with exception handling replaces is_file() + is_zipfile()
  + ZipFile() sequence
- Add missing re-exports: ensure_uploads_dir in uploads/__init__.py,
  SkillAlreadyExistsError in skills/__init__.py
- Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS
- Hoist sandbox_uploads_dir(thread_id) before loop in uploads router

* fix: add input validation for thread_id and filename length

- Reject thread_id containing unsafe filesystem characters (only allow
  alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs
  like <script> or shell metacharacters
- Reject filenames longer than 255 bytes (OS limit) in normalize_filename
- Gateway upload router maps ValueError to 400 for invalid thread_id

* fix: address PR review — symlink safety, input validation coverage, error ordering

- list_files_in_dir: use follow_symlinks=False to prevent symlink metadata
  leakage; check is_dir() instead of exists() for non-directory paths
- install_skill_from_archive: restore is_file() pre-check before extension
  validation so error messages match the documented exception contract
- validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so
  all entry points (upload/list/delete) are protected
- delete_uploaded_file: catch ValueError from thread_id validation (was 500)
- requires_llm marker: also skip when OPENAI_API_KEY is unset
- e2e fixture: update TitleMiddleware exclusion comment (kept filtering —
  middleware triggers extra LLM calls that add non-determinism to tests)

* chore: revert uv.lock to main — no dependency changes in this PR

* fix: use monkeypatch for global config in e2e fixture to prevent test pollution

The e2e_env fixture was calling set_title_config() and
set_summarization_config() directly, which mutated global singletons
without automatic cleanup. When pytest ran test_client_e2e.py before
test_title_middleware_core_logic.py, the leaked enabled=False caused
5 title tests to fail in CI.

Switched to monkeypatch.setattr on the module-level private variables
so pytest restores the originals after each test.

* fix: address code review — URL encoding, API consistency, test isolation

- upload_artifact_url: percent-encode filename to handle spaces/#/?
- deduplicate_filename: mutate seen set in place (caller no longer
  needs manual .add() — less error-prone API)
- list_files_in_dir: document that size is int, enrich stringifies
- e2e fixture: monkeypatch _app_config instead of set_app_config()
  to prevent global singleton pollution (same pattern as title/summarization fix)
- _make_e2e_config: read LLM connection details from env vars so
  external contributors can override defaults
- Update tests to match new deduplicate_filename contract

* docs: rewrite RFC in English and add alternatives/breaking changes sections

* fix: address code review feedback on PR #1202

- Rename deduplicate_filename to claim_unique_filename to make
  the in-place set mutation explicit in the function name
- Replace PermissionError with PathTraversalError(ValueError) for
  path traversal detection — malformed input is 400, not 403

* fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI

---------

Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-03-25 16:28:33 +08:00

8.8 KiB

RFC: Extract Shared Skill Installer and Upload Manager into Harness

1. Problem

Gateway (app/gateway/routers/skills.py, uploads.py) and Client (deerflow/client.py) each independently implement the same business logic:

Skill Installation

Logic Gateway (skills.py) Client (client.py)
Zip safety check _is_unsafe_zip_member() Inline Path(info.filename).is_absolute()
Symlink filtering _is_symlink_member() p.is_symlink() post-extraction delete
Zip bomb defence total_size += info.file_size (declared) total_size > 100MB (declared)
macOS metadata filter _should_ignore_archive_entry() None
Frontmatter validation _validate_skill_frontmatter() _validate_skill_frontmatter()
Duplicate detection HTTPException(409) ValueError

Two implementations, inconsistent behaviour: Gateway streams writes and tracks real decompressed size; Client sums declared file_size. Gateway skips symlinks during extraction; Client extracts everything then walks and deletes symlinks.

Upload Management

Logic Gateway (uploads.py) Client (client.py)
Directory access get_uploads_dir() + mkdir _get_uploads_dir() + mkdir
Filename safety Inline Path(f).name + manual checks No checks, uses src_path.name directly
Duplicate handling None (overwrites) None (overwrites)
Listing Inline iterdir() Inline os.scandir()
Deletion Inline unlink() + traversal check Inline unlink() + traversal check
Path traversal resolve().relative_to() resolve().relative_to()

The same traversal check is written twice — any security fix must be applied to both locations.

2. Design Principles

Dependency Direction

app.gateway.routers.skills  ──┐
app.gateway.routers.uploads ──┤── calls ──→  deerflow.skills.installer
deerflow.client             ──┘              deerflow.uploads.manager
  • Shared modules live in the harness layer (deerflow.*), pure business logic, no FastAPI dependency
  • Gateway handles HTTP adaptation (UploadFile → bytes, exceptions → HTTPException)
  • Client handles local adaptation (Path → copy, exceptions → Python exceptions)
  • Satisfies test_harness_boundary.py constraint: harness never imports app

Exception Strategy

Shared Layer Exception Gateway Maps To Client
FileNotFoundError HTTPException(404) Propagates
ValueError HTTPException(400) Propagates
SkillAlreadyExistsError HTTPException(409) Propagates
PermissionError HTTPException(403) Propagates

Replaces stringly-typed routing ("already exists" in str(e)) with typed exception matching (SkillAlreadyExistsError).

3. New Modules

3.1 deerflow.skills.installer

# Safety checks
is_unsafe_zip_member(info: ZipInfo) -> bool     # Absolute path / .. traversal
is_symlink_member(info: ZipInfo) -> bool         # Unix symlink detection
should_ignore_archive_entry(path: Path) -> bool  # __MACOSX / dotfiles

# Extraction
safe_extract_skill_archive(zip_ref, dest_path, max_total_size=512MB)
  # Streaming write, accumulates real bytes (vs declared file_size)
  # Dual traversal check: member-level + resolve-level

# Directory resolution
resolve_skill_dir_from_archive(temp_path: Path) -> Path
  # Auto-enters single directory, filters macOS metadata

# Install entry point
install_skill_from_archive(zip_path, *, skills_root=None) -> dict
  # is_file() pre-check before extension validation
  # SkillAlreadyExistsError replaces ValueError

# Exception
class SkillAlreadyExistsError(ValueError)

3.2 deerflow.uploads.manager

# Directory management
get_uploads_dir(thread_id: str) -> Path      # Pure path, no side effects
ensure_uploads_dir(thread_id: str) -> Path   # Creates directory (for write paths)

# Filename safety
normalize_filename(filename: str) -> str
  # Path.name extraction + rejects ".." / "." / backslash / >255 bytes
deduplicate_filename(name: str, seen: set) -> str
  # _N suffix increment for dedup, mutates seen in place

# Path safety
validate_path_traversal(path: Path, base: Path) -> None
  # resolve().relative_to(), raises PermissionError on failure

# File operations
list_files_in_dir(directory: Path) -> dict
  # scandir with stat inside context (no re-stat)
  # follow_symlinks=False to prevent metadata leakage
  # Non-existent directory returns empty list
delete_file_safe(base_dir: Path, filename: str) -> dict
  # Validates traversal first, then unlinks

# URL helpers
upload_artifact_url(thread_id, filename) -> str   # Percent-encoded for HTTP safety
upload_virtual_path(filename) -> str               # Sandbox-internal path
enrich_file_listing(result, thread_id) -> dict     # Adds URLs, stringifies sizes

4. Changes

4.1 Gateway Slimming

app/gateway/routers/skills.py:

  • Remove _is_unsafe_zip_member, _is_symlink_member, _safe_extract_skill_archive, _should_ignore_archive_entry, _resolve_skill_dir_from_archive_root (~80 lines)
  • install_skill route becomes a single call to install_skill_from_archive(path)
  • Exception mapping: SkillAlreadyExistsError → 409, ValueError → 400, FileNotFoundError → 404

app/gateway/routers/uploads.py:

  • Remove inline get_uploads_dir (replaced by ensure_uploads_dir/get_uploads_dir)
  • upload_files uses normalize_filename() instead of inline safety checks
  • list_uploaded_files uses list_files_in_dir() + enrichment
  • delete_uploaded_file uses delete_file_safe() + companion markdown cleanup

4.2 Client Slimming

deerflow/client.py:

  • Remove _get_uploads_dir static method
  • Remove ~50 lines of inline zip handling in install_skill
  • install_skill delegates to install_skill_from_archive()
  • upload_files uses deduplicate_filename() + ensure_uploads_dir()
  • list_uploads uses get_uploads_dir() + list_files_in_dir()
  • delete_upload uses get_uploads_dir() + delete_file_safe()
  • update_mcp_config / update_skill now reset _agent_config_key = None

4.3 Read/Write Path Separation

Operation Function Creates dir?
upload (write) ensure_uploads_dir() Yes
list (read) get_uploads_dir() No
delete (read) get_uploads_dir() No

Read paths no longer have mkdir side effects — non-existent directories return empty lists.

5. Security Improvements

Improvement Before After
Zip bomb detection Sum of declared file_size Streaming write, accumulates real bytes
Symlink handling Gateway skips / Client deletes post-extract Unified skip + log
Traversal check Member-level only Member-level + resolve().is_relative_to()
Filename backslash Gateway checks / Client doesn't Unified rejection
Filename length No check Reject > 255 bytes (OS limit)
thread_id validation None Reject unsafe filesystem characters
Listing symlink leak follow_symlinks=True (default) follow_symlinks=False
409 status routing "already exists" in str(e) SkillAlreadyExistsError type match
Artifact URL encoding Raw filename in URL urllib.parse.quote()

6. Alternatives Considered

Alternative Why Not
Keep logic in Gateway, Client calls Gateway via HTTP Adds network dependency to embedded Client; defeats the purpose of DeerFlowClient as an in-process API
Abstract base class with Gateway/Client subclasses Over-engineered for what are pure functions; no polymorphism needed
Move everything into client.py and have Gateway import it Violates harness/app boundary — Client is in harness, but Gateway-specific models (Pydantic response types) should stay in app layer
Merge Gateway and Client into one module They serve different consumers (HTTP vs in-process) with different adaptation needs

7. Breaking Changes

None. All public APIs (Gateway HTTP endpoints, DeerFlowClient methods) retain their existing signatures and return formats. The SkillAlreadyExistsError is a subclass of ValueError, so existing except ValueError handlers still catch it.

8. Tests

Module Test File Count
skills.installer tests/test_skills_installer.py 22
uploads.manager tests/test_uploads_manager.py 20
client hardening tests/test_client.py (new cases) ~40
client e2e tests/test_client_e2e.py (new file) ~20

Coverage: unsafe zip / symlink / zip bomb / frontmatter / duplicate / extension / macOS filter / normalize / deduplicate / traversal / list / delete / agent invalidation / upload lifecycle / thread isolation / URL encoding / config pollution.