mirror of
https://gitee.com/wanwujie/deer-flow
synced 2026-04-03 06:12:14 +08:00
* refactor: extract shared skill installer and upload manager to harness Move duplicated business logic from Gateway routers and Client into shared harness modules, eliminating code duplication. New shared modules: - deerflow.skills.installer: 6 functions (zip security, extraction, install) - deerflow.uploads.manager: 7 functions (normalize, deduplicate, validate, list, delete, get_uploads_dir, ensure_uploads_dir) Key improvements: - SkillAlreadyExistsError replaces stringly-typed 409 status routing - normalize_filename rejects backslash-containing filenames - Read paths (list/delete) no longer mkdir via get_uploads_dir - Write paths use ensure_uploads_dir for explicit directory creation - list_files_in_dir does stat inside scandir context (no re-stat) - install_skill_from_archive uses single is_file() check (one syscall) - Fix agent config key not reset on update_mcp_config/update_skill Tests: 42 new (22 installer + 20 upload manager) + client hardening * refactor: centralize upload URL construction and clean up installer - Extract upload_virtual_path(), upload_artifact_url(), enrich_file_listing() into shared manager.py, eliminating 6 duplicated URL constructions across Gateway router and Client - Derive all upload URLs from VIRTUAL_PATH_PREFIX constant instead of hardcoded "mnt/user-data/uploads" strings - Eliminate TOCTOU pre-checks and double file read in installer — single ZipFile() open with exception handling replaces is_file() + is_zipfile() + ZipFile() sequence - Add missing re-exports: ensure_uploads_dir in uploads/__init__.py, SkillAlreadyExistsError in skills/__init__.py - Remove redundant .lower() on already-lowercase CONVERTIBLE_EXTENSIONS - Hoist sandbox_uploads_dir(thread_id) before loop in uploads router * fix: add input validation for thread_id and filename length - Reject thread_id containing unsafe filesystem characters (only allow alphanumeric, hyphens, underscores, dots) — prevents 500 on inputs like <script> or shell metacharacters - Reject filenames longer than 255 bytes (OS limit) in normalize_filename - Gateway upload router maps ValueError to 400 for invalid thread_id * fix: address PR review — symlink safety, input validation coverage, error ordering - list_files_in_dir: use follow_symlinks=False to prevent symlink metadata leakage; check is_dir() instead of exists() for non-directory paths - install_skill_from_archive: restore is_file() pre-check before extension validation so error messages match the documented exception contract - validate_thread_id: move from ensure_uploads_dir to get_uploads_dir so all entry points (upload/list/delete) are protected - delete_uploaded_file: catch ValueError from thread_id validation (was 500) - requires_llm marker: also skip when OPENAI_API_KEY is unset - e2e fixture: update TitleMiddleware exclusion comment (kept filtering — middleware triggers extra LLM calls that add non-determinism to tests) * chore: revert uv.lock to main — no dependency changes in this PR * fix: use monkeypatch for global config in e2e fixture to prevent test pollution The e2e_env fixture was calling set_title_config() and set_summarization_config() directly, which mutated global singletons without automatic cleanup. When pytest ran test_client_e2e.py before test_title_middleware_core_logic.py, the leaked enabled=False caused 5 title tests to fail in CI. Switched to monkeypatch.setattr on the module-level private variables so pytest restores the originals after each test. * fix: address code review — URL encoding, API consistency, test isolation - upload_artifact_url: percent-encode filename to handle spaces/#/? - deduplicate_filename: mutate seen set in place (caller no longer needs manual .add() — less error-prone API) - list_files_in_dir: document that size is int, enrich stringifies - e2e fixture: monkeypatch _app_config instead of set_app_config() to prevent global singleton pollution (same pattern as title/summarization fix) - _make_e2e_config: read LLM connection details from env vars so external contributors can override defaults - Update tests to match new deduplicate_filename contract * docs: rewrite RFC in English and add alternatives/breaking changes sections * fix: address code review feedback on PR #1202 - Rename deduplicate_filename to claim_unique_filename to make the in-place set mutation explicit in the function name - Replace PermissionError with PathTraversalError(ValueError) for path traversal detection — malformed input is 400, not 403 * fix: set _app_config_is_custom in e2e test fixture to prevent config.yaml lookup in CI --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
191 lines
8.8 KiB
Markdown
191 lines
8.8 KiB
Markdown
# RFC: Extract Shared Skill Installer and Upload Manager into Harness
|
|
|
|
## 1. Problem
|
|
|
|
Gateway (`app/gateway/routers/skills.py`, `uploads.py`) and Client (`deerflow/client.py`) each independently implement the same business logic:
|
|
|
|
### Skill Installation
|
|
|
|
| Logic | Gateway (`skills.py`) | Client (`client.py`) |
|
|
|-------|----------------------|---------------------|
|
|
| Zip safety check | `_is_unsafe_zip_member()` | Inline `Path(info.filename).is_absolute()` |
|
|
| Symlink filtering | `_is_symlink_member()` | `p.is_symlink()` post-extraction delete |
|
|
| Zip bomb defence | `total_size += info.file_size` (declared) | `total_size > 100MB` (declared) |
|
|
| macOS metadata filter | `_should_ignore_archive_entry()` | None |
|
|
| Frontmatter validation | `_validate_skill_frontmatter()` | `_validate_skill_frontmatter()` |
|
|
| Duplicate detection | `HTTPException(409)` | `ValueError` |
|
|
|
|
**Two implementations, inconsistent behaviour**: Gateway streams writes and tracks real decompressed size; Client sums declared `file_size`. Gateway skips symlinks during extraction; Client extracts everything then walks and deletes symlinks.
|
|
|
|
### Upload Management
|
|
|
|
| Logic | Gateway (`uploads.py`) | Client (`client.py`) |
|
|
|-------|----------------------|---------------------|
|
|
| Directory access | `get_uploads_dir()` + `mkdir` | `_get_uploads_dir()` + `mkdir` |
|
|
| Filename safety | Inline `Path(f).name` + manual checks | No checks, uses `src_path.name` directly |
|
|
| Duplicate handling | None (overwrites) | None (overwrites) |
|
|
| Listing | Inline `iterdir()` | Inline `os.scandir()` |
|
|
| Deletion | Inline `unlink()` + traversal check | Inline `unlink()` + traversal check |
|
|
| Path traversal | `resolve().relative_to()` | `resolve().relative_to()` |
|
|
|
|
**The same traversal check is written twice** — any security fix must be applied to both locations.
|
|
|
|
## 2. Design Principles
|
|
|
|
### Dependency Direction
|
|
|
|
```
|
|
app.gateway.routers.skills ──┐
|
|
app.gateway.routers.uploads ──┤── calls ──→ deerflow.skills.installer
|
|
deerflow.client ──┘ deerflow.uploads.manager
|
|
```
|
|
|
|
- Shared modules live in the harness layer (`deerflow.*`), pure business logic, no FastAPI dependency
|
|
- Gateway handles HTTP adaptation (`UploadFile` → bytes, exceptions → `HTTPException`)
|
|
- Client handles local adaptation (`Path` → copy, exceptions → Python exceptions)
|
|
- Satisfies `test_harness_boundary.py` constraint: harness never imports app
|
|
|
|
### Exception Strategy
|
|
|
|
| Shared Layer Exception | Gateway Maps To | Client |
|
|
|----------------------|-----------------|--------|
|
|
| `FileNotFoundError` | `HTTPException(404)` | Propagates |
|
|
| `ValueError` | `HTTPException(400)` | Propagates |
|
|
| `SkillAlreadyExistsError` | `HTTPException(409)` | Propagates |
|
|
| `PermissionError` | `HTTPException(403)` | Propagates |
|
|
|
|
Replaces stringly-typed routing (`"already exists" in str(e)`) with typed exception matching (`SkillAlreadyExistsError`).
|
|
|
|
## 3. New Modules
|
|
|
|
### 3.1 `deerflow.skills.installer`
|
|
|
|
```python
|
|
# Safety checks
|
|
is_unsafe_zip_member(info: ZipInfo) -> bool # Absolute path / .. traversal
|
|
is_symlink_member(info: ZipInfo) -> bool # Unix symlink detection
|
|
should_ignore_archive_entry(path: Path) -> bool # __MACOSX / dotfiles
|
|
|
|
# Extraction
|
|
safe_extract_skill_archive(zip_ref, dest_path, max_total_size=512MB)
|
|
# Streaming write, accumulates real bytes (vs declared file_size)
|
|
# Dual traversal check: member-level + resolve-level
|
|
|
|
# Directory resolution
|
|
resolve_skill_dir_from_archive(temp_path: Path) -> Path
|
|
# Auto-enters single directory, filters macOS metadata
|
|
|
|
# Install entry point
|
|
install_skill_from_archive(zip_path, *, skills_root=None) -> dict
|
|
# is_file() pre-check before extension validation
|
|
# SkillAlreadyExistsError replaces ValueError
|
|
|
|
# Exception
|
|
class SkillAlreadyExistsError(ValueError)
|
|
```
|
|
|
|
### 3.2 `deerflow.uploads.manager`
|
|
|
|
```python
|
|
# Directory management
|
|
get_uploads_dir(thread_id: str) -> Path # Pure path, no side effects
|
|
ensure_uploads_dir(thread_id: str) -> Path # Creates directory (for write paths)
|
|
|
|
# Filename safety
|
|
normalize_filename(filename: str) -> str
|
|
# Path.name extraction + rejects ".." / "." / backslash / >255 bytes
|
|
deduplicate_filename(name: str, seen: set) -> str
|
|
# _N suffix increment for dedup, mutates seen in place
|
|
|
|
# Path safety
|
|
validate_path_traversal(path: Path, base: Path) -> None
|
|
# resolve().relative_to(), raises PermissionError on failure
|
|
|
|
# File operations
|
|
list_files_in_dir(directory: Path) -> dict
|
|
# scandir with stat inside context (no re-stat)
|
|
# follow_symlinks=False to prevent metadata leakage
|
|
# Non-existent directory returns empty list
|
|
delete_file_safe(base_dir: Path, filename: str) -> dict
|
|
# Validates traversal first, then unlinks
|
|
|
|
# URL helpers
|
|
upload_artifact_url(thread_id, filename) -> str # Percent-encoded for HTTP safety
|
|
upload_virtual_path(filename) -> str # Sandbox-internal path
|
|
enrich_file_listing(result, thread_id) -> dict # Adds URLs, stringifies sizes
|
|
```
|
|
|
|
## 4. Changes
|
|
|
|
### 4.1 Gateway Slimming
|
|
|
|
**`app/gateway/routers/skills.py`**:
|
|
- Remove `_is_unsafe_zip_member`, `_is_symlink_member`, `_safe_extract_skill_archive`, `_should_ignore_archive_entry`, `_resolve_skill_dir_from_archive_root` (~80 lines)
|
|
- `install_skill` route becomes a single call to `install_skill_from_archive(path)`
|
|
- Exception mapping: `SkillAlreadyExistsError → 409`, `ValueError → 400`, `FileNotFoundError → 404`
|
|
|
|
**`app/gateway/routers/uploads.py`**:
|
|
- Remove inline `get_uploads_dir` (replaced by `ensure_uploads_dir`/`get_uploads_dir`)
|
|
- `upload_files` uses `normalize_filename()` instead of inline safety checks
|
|
- `list_uploaded_files` uses `list_files_in_dir()` + enrichment
|
|
- `delete_uploaded_file` uses `delete_file_safe()` + companion markdown cleanup
|
|
|
|
### 4.2 Client Slimming
|
|
|
|
**`deerflow/client.py`**:
|
|
- Remove `_get_uploads_dir` static method
|
|
- Remove ~50 lines of inline zip handling in `install_skill`
|
|
- `install_skill` delegates to `install_skill_from_archive()`
|
|
- `upload_files` uses `deduplicate_filename()` + `ensure_uploads_dir()`
|
|
- `list_uploads` uses `get_uploads_dir()` + `list_files_in_dir()`
|
|
- `delete_upload` uses `get_uploads_dir()` + `delete_file_safe()`
|
|
- `update_mcp_config` / `update_skill` now reset `_agent_config_key = None`
|
|
|
|
### 4.3 Read/Write Path Separation
|
|
|
|
| Operation | Function | Creates dir? |
|
|
|-----------|----------|:------------:|
|
|
| upload (write) | `ensure_uploads_dir()` | Yes |
|
|
| list (read) | `get_uploads_dir()` | No |
|
|
| delete (read) | `get_uploads_dir()` | No |
|
|
|
|
Read paths no longer have `mkdir` side effects — non-existent directories return empty lists.
|
|
|
|
## 5. Security Improvements
|
|
|
|
| Improvement | Before | After |
|
|
|-------------|--------|-------|
|
|
| Zip bomb detection | Sum of declared `file_size` | Streaming write, accumulates real bytes |
|
|
| Symlink handling | Gateway skips / Client deletes post-extract | Unified skip + log |
|
|
| Traversal check | Member-level only | Member-level + `resolve().is_relative_to()` |
|
|
| Filename backslash | Gateway checks / Client doesn't | Unified rejection |
|
|
| Filename length | No check | Reject > 255 bytes (OS limit) |
|
|
| thread_id validation | None | Reject unsafe filesystem characters |
|
|
| Listing symlink leak | `follow_symlinks=True` (default) | `follow_symlinks=False` |
|
|
| 409 status routing | `"already exists" in str(e)` | `SkillAlreadyExistsError` type match |
|
|
| Artifact URL encoding | Raw filename in URL | `urllib.parse.quote()` |
|
|
|
|
## 6. Alternatives Considered
|
|
|
|
| Alternative | Why Not |
|
|
|-------------|---------|
|
|
| Keep logic in Gateway, Client calls Gateway via HTTP | Adds network dependency to embedded Client; defeats the purpose of `DeerFlowClient` as an in-process API |
|
|
| Abstract base class with Gateway/Client subclasses | Over-engineered for what are pure functions; no polymorphism needed |
|
|
| Move everything into `client.py` and have Gateway import it | Violates harness/app boundary — Client is in harness, but Gateway-specific models (Pydantic response types) should stay in app layer |
|
|
| Merge Gateway and Client into one module | They serve different consumers (HTTP vs in-process) with different adaptation needs |
|
|
|
|
## 7. Breaking Changes
|
|
|
|
**None.** All public APIs (Gateway HTTP endpoints, `DeerFlowClient` methods) retain their existing signatures and return formats. The `SkillAlreadyExistsError` is a subclass of `ValueError`, so existing `except ValueError` handlers still catch it.
|
|
|
|
## 8. Tests
|
|
|
|
| Module | Test File | Count |
|
|
|--------|-----------|:-----:|
|
|
| `skills.installer` | `tests/test_skills_installer.py` | 22 |
|
|
| `uploads.manager` | `tests/test_uploads_manager.py` | 20 |
|
|
| `client` hardening | `tests/test_client.py` (new cases) | ~40 |
|
|
| `client` e2e | `tests/test_client_e2e.py` (new file) | ~20 |
|
|
|
|
Coverage: unsafe zip / symlink / zip bomb / frontmatter / duplicate / extension / macOS filter / normalize / deduplicate / traversal / list / delete / agent invalidation / upload lifecycle / thread isolation / URL encoding / config pollution.
|