feat: add IM channels for Feishu, Slack, and Telegram (#1010)

* feat: add IM channels system for Feishu, Slack, and Telegram integration Bridge external messaging platforms to DeerFlow via LangGraph Server with async message bus, thread management, and per-channel configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address review comments on IM channels system Fix topic_id handling in store remove/list_entries and manager commands, correct Telegram reply threading, remove unused imports/variables, update docstrings and docs to match implementation, and prevent config mutation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * update skill creator * fix im reply text * fix comments --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 06:12:14 +08:00 · 2026-03-08 15:21:18 +08:00
parent d664ae5a4b
commit 75b7302000
49 changed files with 8354 additions and 367 deletions
--- a/.env.example
+++ b/.env.example
@@ -15,4 +15,10 @@ INFOQUEST_API_KEY=your-infoquest-api-key
 # OPENAI_API_KEY=your-openai-api-key
 # GEMINI_API_KEY=your-gemini-api-key
 # DEEPSEEK_API_KEY=your-deepseek-api-key
-# NOVITA_API_KEY=your-novita-api-key  # OpenAI-compatible, see https://novita.ai
+# NOVITA_API_KEY=your-novita-api-key  # OpenAI-compatible, see https://novita.ai
+# FEISHU_APP_ID=your-feishu-app-id
+# FEISHU_APP_SECRET=your-feishu-app-secret
+
+# SLACK_BOT_TOKEN=your-slack-bot-token
+# SLACK_APP_TOKEN=your-slack-app-token
+# TELEGRAM_BOT_TOKEN=your-telegram-bot-token
--- a/README.md
+++ b/README.md
@@ -41,6 +41,7 @@ DeerFlow has newly integrated the intelligent search and crawling toolset indepe
    - [Advanced](#advanced)
      - [Sandbox Mode](#sandbox-mode)
      - [MCP Server](#mcp-server)
+    - [IM Channels](#im-channels)
  - [From Deep Research to Super Agent Harness](#from-deep-research-to-super-agent-harness)
  - [Core Features](#core-features)
    - [Skills \& Tools](#skills--tools)
@@ -184,6 +185,91 @@ DeerFlow supports configurable MCP servers and skills to extend its capabilities
 For HTTP/SSE MCP servers, OAuth token flows are supported (`client_credentials`, `refresh_token`).
 See the [MCP Server Guide](backend/docs/MCP_SERVER.md) for detailed instructions.

+#### IM Channels
+
+DeerFlow supports receiving tasks from messaging apps. Channels auto-start when configured — no public IP required for any of them.
+
+| Channel | Transport | Difficulty |
+|---------|-----------|------------|
+| Telegram | Bot API (long-polling) | Easy |
+| Slack | Socket Mode | Moderate |
+| Feishu / Lark | WebSocket | Moderate |
+
+**Configuration in `config.yaml`:**
+
+```yaml
+channels:
+  # LangGraph Server URL (default: http://localhost:2024)
+  langgraph_url: http://localhost:2024
+  # Gateway API URL (default: http://localhost:8001)
+  gateway_url: http://localhost:8001
+
+  feishu:
+    enabled: true
+    app_id: $FEISHU_APP_ID
+    app_secret: $FEISHU_APP_SECRET
+
+  slack:
+    enabled: true
+    bot_token: $SLACK_BOT_TOKEN     # xoxb-...
+    app_token: $SLACK_APP_TOKEN     # xapp-... (Socket Mode)
+    allowed_users: []               # empty = allow all
+
+  telegram:
+    enabled: true
+    bot_token: $TELEGRAM_BOT_TOKEN
+    allowed_users: []               # empty = allow all
+```
+
+Set the corresponding API keys in your `.env` file:
+
+```bash
+# Telegram
+TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrSTUvwxYZ
+
+# Slack
+SLACK_BOT_TOKEN=xoxb-...
+SLACK_APP_TOKEN=xapp-...
+
+# Feishu / Lark
+FEISHU_APP_ID=cli_xxxx
+FEISHU_APP_SECRET=your_app_secret
+```
+
+**Telegram Setup**
+
+1. Chat with [@BotFather](https://t.me/BotFather), send `/newbot`, and copy the HTTP API token.
+2. Set `TELEGRAM_BOT_TOKEN` in `.env` and enable the channel in `config.yaml`.
+
+**Slack Setup**
+
+1. Create a Slack App at [api.slack.com/apps](https://api.slack.com/apps) → Create New App → From scratch.
+2. Under **OAuth & Permissions**, add Bot Token Scopes: `app_mentions:read`, `chat:write`, `im:history`, `im:read`, `im:write`.
+3. Enable **Socket Mode** → generate an App-Level Token (`xapp-…`) with `connections:write` scope.
+4. Under **Event Subscriptions**, subscribe to bot events: `app_mention`, `message.im`.
+5. Set `SLACK_BOT_TOKEN` and `SLACK_APP_TOKEN` in `.env` and enable the channel in `config.yaml`.
+
+**Feishu / Lark Setup**
+
+1. Create an app on [Feishu Open Platform](https://open.feishu.cn/) → enable **Bot** capability.
+2. Add permissions: `im:message`, `im:resource`.
+3. Under **Events**, subscribe to `im.message.receive_v1` and select **Long Connection** mode.
+4. Copy the App ID and App Secret. Set `FEISHU_APP_ID` and `FEISHU_APP_SECRET` in `.env` and enable the channel in `config.yaml`.
+
+**Commands**
+
+Once a channel is connected, you can interact with DeerFlow directly from the chat:
+
+| Command | Description |
+|---------|-------------|
+| `/new` | Start a new conversation |
+| `/status` | Show current thread info |
+| `/models` | List available models |
+| `/memory` | View memory |
+| `/help` | Show help |
+
+> Messages without a command prefix are treated as regular chat — DeerFlow creates a thread and responds conversationally.
+
 ## From Deep Research to Super Agent Harness

 DeerFlow started as a Deep Research framework — and the community ran with it. Since launch, developers have pushed it far beyond research: building data pipelines, generating slide decks, spinning up dashboards, automating content workflows. Things we never anticipated.
--- a/backend/CLAUDE.md
+++ b/backend/CLAUDE.md
@@ -243,6 +243,32 @@ Proxied through nginx: `/api/langgraph/*` → LangGraph, all other `/api/*` →
 - Config values starting with `$` resolved as environment variables
 - Missing provider modules surface actionable install hints from reflection resolvers (for example `uv add langchain-google-genai`)

+### IM Channels System (`src/channels/`)
+
+Bridges external messaging platforms (Feishu, Slack, Telegram) to the DeerFlow agent via the LangGraph Server.
+
+**Architecture**: Channels communicate with the LangGraph Server through `langgraph-sdk` HTTP client (same as the frontend), ensuring threads are created and managed server-side.
+
+**Components**:
+- `message_bus.py` - Async pub/sub hub (`InboundMessage` → queue → dispatcher; `OutboundMessage` → callbacks → channels)
+- `store.py` - JSON-file persistence mapping `channel_name:chat_id[:topic_id]` → `thread_id` (keys are `channel:chat` for root conversations and `channel:chat:topic` for threaded conversations)
+- `manager.py` - Core dispatcher: creates threads via `client.threads.create()`, sends messages via `client.runs.wait()`, routes commands
+- `base.py` - Abstract `Channel` base class (start/stop/send lifecycle)
+- `service.py` - Manages lifecycle of all configured channels from `config.yaml`
+- `slack.py` / `feishu.py` / `telegram.py` - Platform-specific implementations
+
+**Message Flow**:
+1. External platform → Channel impl → `MessageBus.publish_inbound()`
+2. `ChannelManager._dispatch_loop()` consumes from queue
+3. For chat: look up/create thread on LangGraph Server → `runs.wait()` → extract response → publish outbound
+4. For commands (`/new`, `/status`, `/models`, `/memory`, `/help`): handle locally or query Gateway API
+5. Outbound → channel callbacks → platform reply
+
+**Configuration** (`config.yaml` → `channels`):
+- `langgraph_url` - LangGraph Server URL (default: `http://localhost:2024`)
+- `gateway_url` - Gateway API URL for auxiliary commands (default: `http://localhost:8001`)
+- Per-channel configs: `feishu` (app_id, app_secret), `slack` (bot_token, app_token), `telegram` (bot_token)
+
 ### Memory System (`src/agents/memory/`)

 **Components**:
--- a/backend/docs/TODO.md
+++ b/backend/docs/TODO.md
@@ -20,6 +20,13 @@
 - [ ] Add metrics and monitoring
 - [ ] Support for more document formats in upload
 - [ ] Skill marketplace / remote skill installation
+- [ ] Optimize async concurrency in agent hot path (IM channels multi-task scenario)
+  - Replace `time.sleep(5)` with `asyncio.sleep()` in `src/tools/builtins/task_tool.py` (subagent polling)
+  - Replace `subprocess.run()` with `asyncio.create_subprocess_shell()` in `src/sandbox/local/local_sandbox.py`
+  - Replace sync `requests` with `httpx.AsyncClient` in community tools (tavily, jina_ai, firecrawl, infoquest, image_search)
+  - Replace sync `model.invoke()` with async `model.ainvoke()` in title_middleware and memory updater
+  - Consider `asyncio.to_thread()` wrapper for remaining blocking file I/O
+  - For production: use `langgraph up` (multi-worker) instead of `langgraph dev` (single-worker)

 ## Resolved Issues

--- a/backend/pyproject.toml
+++ b/backend/pyproject.toml
@@ -34,6 +34,11 @@ dependencies = [
    "duckdb>=1.4.4",
    "langchain-google-genai>=4.2.1",
    "langgraph-checkpoint-sqlite>=3.0.3",
+    "lark-oapi>=1.4.0",
+    "slack-sdk>=3.33.0",
+    "python-telegram-bot>=21.0",
+    "langgraph-sdk>=0.1.51",
+    "markdown-to-mrkdwn>=0.3.1",
 ]

 [dependency-groups]
--- a/backend/src/agents/memory/prompt.py
+++ b/backend/src/agents/memory/prompt.py
@@ -257,9 +257,7 @@ def format_conversation_for_update(messages: list[Any]) -> str:
        # ephemeral file path info into long-term memory.  Skip the turn entirely
        # when nothing remains after stripping (upload-only message).
        if role == "human":
-            content = re.sub(
-                r"<uploaded_files>[\s\S]*?</uploaded_files>\n*", "", str(content)
-            ).strip()
+            content = re.sub(r"<uploaded_files>[\s\S]*?</uploaded_files>\n*", "", str(content)).strip()
            if not content:
                continue

--- a/backend/src/agents/memory/updater.py
+++ b/backend/src/agents/memory/updater.py
@@ -168,11 +168,7 @@ def _strip_upload_mentions_from_memory(memory_data: dict[str, Any]) -> dict[str,
    # Also remove any facts that describe upload events
    facts = memory_data.get("facts", [])
    if facts:
-        memory_data["facts"] = [
-            f
-            for f in facts
-            if not _UPLOAD_SENTENCE_RE.search(f.get("content", ""))
-        ]
+        memory_data["facts"] = [f for f in facts if not _UPLOAD_SENTENCE_RE.search(f.get("content", ""))]

    return memory_data

--- a/backend/src/agents/middlewares/memory_middleware.py
+++ b/backend/src/agents/middlewares/memory_middleware.py
@@ -40,9 +40,7 @@ def _filter_messages_for_memory(messages: list[Any]) -> list[Any]:
    Returns:
        Filtered list containing only user inputs and final assistant responses.
    """
-    _UPLOAD_BLOCK_RE = re.compile(
-        r"<uploaded_files>[\s\S]*?</uploaded_files>\n*", re.IGNORECASE
-    )
+    _UPLOAD_BLOCK_RE = re.compile(r"<uploaded_files>[\s\S]*?</uploaded_files>\n*", re.IGNORECASE)

    filtered = []
    skip_next_ai = False
@@ -52,9 +50,7 @@ def _filter_messages_for_memory(messages: list[Any]) -> list[Any]:
        if msg_type == "human":
            content = getattr(msg, "content", "")
            if isinstance(content, list):
-                content = " ".join(
-                    p.get("text", "") for p in content if isinstance(p, dict)
-                )
+                content = " ".join(p.get("text", "") for p in content if isinstance(p, dict))
            content_str = str(content)
            if "<uploaded_files>" in content_str:
                # Strip the ephemeral upload block; keep the user's real question.
--- a/backend/src/channels/init.py
+++ b/backend/src/channels/init.py
@@ -0,0 +1,16 @@
+"""IM Channel integration for DeerFlow.
+
+Provides a pluggable channel system that connects external messaging platforms
+(Feishu/Lark, Slack, Telegram) to the DeerFlow agent via the ChannelManager,
+which uses ``langgraph-sdk`` to communicate with the underlying LangGraph Server.
+"""
+
+from src.channels.base import Channel
+from src.channels.message_bus import InboundMessage, MessageBus, OutboundMessage
+
+__all__ = [
+    "Channel",
+    "InboundMessage",
+    "MessageBus",
+    "OutboundMessage",
+]
--- a/backend/src/channels/base.py
+++ b/backend/src/channels/base.py
@@ -0,0 +1,88 @@
+"""Abstract base class for IM channels."""
+
+from __future__ import annotations
+
+import logging
+from abc import ABC, abstractmethod
+from typing import Any
+
+from src.channels.message_bus import InboundMessage, InboundMessageType, MessageBus, OutboundMessage
+
+logger = logging.getLogger(__name__)
+
+
+class Channel(ABC):
+    """Base class for all IM channel implementations.
+
+    Each channel connects to an external messaging platform and:
+    1. Receives messages, wraps them as InboundMessage, publishes to the bus.
+    2. Subscribes to outbound messages and sends replies back to the platform.
+
+    Subclasses must implement ``start``, ``stop``, and ``send``.
+    """
+
+    def __init__(self, name: str, bus: MessageBus, config: dict[str, Any]) -> None:
+        self.name = name
+        self.bus = bus
+        self.config = config
+        self._running = False
+
+    @property
+    def is_running(self) -> bool:
+        return self._running
+
+    # -- lifecycle ---------------------------------------------------------
+
+    @abstractmethod
+    async def start(self) -> None:
+        """Start listening for messages from the external platform."""
+
+    @abstractmethod
+    async def stop(self) -> None:
+        """Gracefully stop the channel."""
+
+    # -- outbound ----------------------------------------------------------
+
+    @abstractmethod
+    async def send(self, msg: OutboundMessage) -> None:
+        """Send a message back to the external platform.
+
+        The implementation should use ``msg.chat_id`` and ``msg.thread_ts``
+        to route the reply to the correct conversation/thread.
+        """
+
+    # -- helpers -----------------------------------------------------------
+
+    def _make_inbound(
+        self,
+        chat_id: str,
+        user_id: str,
+        text: str,
+        *,
+        msg_type: InboundMessageType = InboundMessageType.CHAT,
+        thread_ts: str | None = None,
+        files: list[dict[str, Any]] | None = None,
+        metadata: dict[str, Any] | None = None,
+    ) -> InboundMessage:
+        """Convenience factory for creating InboundMessage instances."""
+        return InboundMessage(
+            channel_name=self.name,
+            chat_id=chat_id,
+            user_id=user_id,
+            text=text,
+            msg_type=msg_type,
+            thread_ts=thread_ts,
+            files=files or [],
+            metadata=metadata or {},
+        )
+
+    async def _on_outbound(self, msg: OutboundMessage) -> None:
+        """Outbound callback registered with the bus.
+
+        Only forwards messages targeted at this channel.
+        """
+        if msg.channel_name == self.name:
+            try:
+                await self.send(msg)
+            except Exception:
+                logger.exception("Failed to send outbound message on channel %s", self.name)
--- a/backend/src/channels/feishu.py
+++ b/backend/src/channels/feishu.py
@@ -0,0 +1,301 @@
+"""Feishu/Lark channel — connects to Feishu via WebSocket (no public IP needed)."""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import logging
+import threading
+from typing import Any
+
+from src.channels.base import Channel
+from src.channels.message_bus import InboundMessageType, MessageBus, OutboundMessage
+
+logger = logging.getLogger(__name__)
+
+
+class FeishuChannel(Channel):
+    """Feishu/Lark IM channel using the ``lark-oapi`` WebSocket client.
+
+    Configuration keys (in ``config.yaml`` under ``channels.feishu``):
+        - ``app_id``: Feishu app ID.
+        - ``app_secret``: Feishu app secret.
+        - ``verification_token``: (optional) Event verification token.
+
+    The channel uses WebSocket long-connection mode so no public IP is required.
+
+    Message flow:
+        1. User sends a message → bot adds "OK" emoji reaction
+        2. Bot replies in thread: "Working on it......"
+        3. Agent processes the message and returns a result
+        4. Bot replies in thread with the result
+        5. Bot adds "DONE" emoji reaction to the original message
+    """
+
+    def __init__(self, bus: MessageBus, config: dict[str, Any]) -> None:
+        super().__init__(name="feishu", bus=bus, config=config)
+        self._thread: threading.Thread | None = None
+        self._main_loop: asyncio.AbstractEventLoop | None = None
+        self._api_client = None
+        self._CreateMessageReactionRequest = None
+        self._CreateMessageReactionRequestBody = None
+        self._Emoji = None
+
+    async def start(self) -> None:
+        if self._running:
+            return
+
+        try:
+            import lark_oapi as lark
+            from lark_oapi.api.im.v1 import (
+                CreateMessageReactionRequest,
+                CreateMessageReactionRequestBody,
+                CreateMessageRequest,
+                CreateMessageRequestBody,
+                Emoji,
+                ReplyMessageRequest,
+                ReplyMessageRequestBody,
+            )
+        except ImportError:
+            logger.error("lark-oapi is not installed. Install it with: uv add lark-oapi")
+            return
+
+        self._lark = lark
+        self._CreateMessageRequest = CreateMessageRequest
+        self._CreateMessageRequestBody = CreateMessageRequestBody
+        self._ReplyMessageRequest = ReplyMessageRequest
+        self._ReplyMessageRequestBody = ReplyMessageRequestBody
+        self._CreateMessageReactionRequest = CreateMessageReactionRequest
+        self._CreateMessageReactionRequestBody = CreateMessageReactionRequestBody
+        self._Emoji = Emoji
+
+        app_id = self.config.get("app_id", "")
+        app_secret = self.config.get("app_secret", "")
+
+        if not app_id or not app_secret:
+            logger.error("Feishu channel requires app_id and app_secret")
+            return
+
+        self._api_client = lark.Client.builder().app_id(app_id).app_secret(app_secret).build()
+        self._main_loop = asyncio.get_event_loop()
+
+        self._running = True
+        self.bus.subscribe_outbound(self._on_outbound)
+
+        # Both ws.Client construction and start() must happen in a dedicated
+        # thread with its own event loop.  lark-oapi caches the running loop
+        # at construction time and later calls loop.run_until_complete(),
+        # which conflicts with an already-running uvloop.
+        self._thread = threading.Thread(
+            target=self._run_ws,
+            args=(app_id, app_secret),
+            daemon=True,
+        )
+        self._thread.start()
+        logger.info("Feishu channel started")
+
+    def _run_ws(self, app_id: str, app_secret: str) -> None:
+        """Construct and run the lark WS client in a thread with a fresh event loop.
+
+        The lark-oapi SDK captures a module-level event loop at import time
+        (``lark_oapi.ws.client.loop``).  When uvicorn uses uvloop, that
+        captured loop is the *main* thread's uvloop — which is already
+        running, so ``loop.run_until_complete()`` inside ``Client.start()``
+        raises ``RuntimeError``.
+
+        We work around this by creating a plain asyncio event loop for this
+        thread and patching the SDK's module-level reference before calling
+        ``start()``.
+        """
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            import lark_oapi as lark
+            import lark_oapi.ws.client as _ws_client_mod
+
+            # Replace the SDK's module-level loop so Client.start() uses
+            # this thread's (non-running) event loop instead of the main
+            # thread's uvloop.
+            _ws_client_mod.loop = loop
+
+            event_handler = lark.EventDispatcherHandler.builder("", "").register_p2_im_message_receive_v1(self._on_message).build()
+            ws_client = lark.ws.Client(
+                app_id=app_id,
+                app_secret=app_secret,
+                event_handler=event_handler,
+                log_level=lark.LogLevel.INFO,
+            )
+            ws_client.start()
+        except Exception:
+            if self._running:
+                logger.exception("Feishu WebSocket error")
+
+    async def stop(self) -> None:
+        self._running = False
+        self.bus.unsubscribe_outbound(self._on_outbound)
+        if self._thread:
+            self._thread.join(timeout=5)
+            self._thread = None
+        logger.info("Feishu channel stopped")
+
+    async def send(self, msg: OutboundMessage, *, _max_retries: int = 3) -> None:
+        if not self._api_client:
+            logger.warning("[Feishu] send called but no api_client available")
+            return
+
+        logger.info(
+            "[Feishu] sending reply: chat_id=%s, thread_ts=%s, text_len=%d",
+            msg.chat_id,
+            msg.thread_ts,
+            len(msg.text),
+        )
+        content = self._build_card_content(msg.text)
+
+        last_exc: Exception | None = None
+        for attempt in range(_max_retries):
+            try:
+                if msg.thread_ts:
+                    # Reply in thread (话题)
+                    request = self._ReplyMessageRequest.builder().message_id(msg.thread_ts).request_body(self._ReplyMessageRequestBody.builder().msg_type("interactive").content(content).reply_in_thread(True).build()).build()
+                    await asyncio.to_thread(self._api_client.im.v1.message.reply, request)
+                else:
+                    # Send new message
+                    request = self._CreateMessageRequest.builder().receive_id_type("chat_id").request_body(self._CreateMessageRequestBody.builder().receive_id(msg.chat_id).msg_type("interactive").content(content).build()).build()
+                    await asyncio.to_thread(self._api_client.im.v1.message.create, request)
+
+                # Add "DONE" reaction to the original message on final reply
+                if msg.is_final and msg.thread_ts:
+                    await self._add_reaction(msg.thread_ts, "DONE")
+
+                return  # success
+            except Exception as exc:
+                last_exc = exc
+                if attempt < _max_retries - 1:
+                    delay = 2**attempt  # 1s, 2s
+                    logger.warning(
+                        "[Feishu] send failed (attempt %d/%d), retrying in %ds: %s",
+                        attempt + 1,
+                        _max_retries,
+                        delay,
+                        exc,
+                    )
+                    await asyncio.sleep(delay)
+
+        logger.error("[Feishu] send failed after %d attempts: %s", _max_retries, last_exc)
+        raise last_exc  # type: ignore[misc]
+
+    # -- message formatting ------------------------------------------------
+
+    @staticmethod
+    def _build_card_content(text: str) -> str:
+        """Build a Feishu interactive card with markdown content.
+
+        Feishu's interactive card format natively renders markdown, including
+        headers, bold/italic, code blocks, lists, and links.
+        """
+        card = {
+            "config": {"wide_screen_mode": True},
+            "elements": [{"tag": "markdown", "content": text}],
+        }
+        return json.dumps(card)
+
+    # -- reaction helpers --------------------------------------------------
+
+    async def _add_reaction(self, message_id: str, emoji_type: str = "THUMBSUP") -> None:
+        """Add an emoji reaction to a message."""
+        if not self._api_client or not self._CreateMessageReactionRequest:
+            return
+        try:
+            request = self._CreateMessageReactionRequest.builder().message_id(message_id).request_body(self._CreateMessageReactionRequestBody.builder().reaction_type(self._Emoji.builder().emoji_type(emoji_type).build()).build()).build()
+            await asyncio.to_thread(self._api_client.im.v1.message_reaction.create, request)
+            logger.info("[Feishu] reaction '%s' added to message %s", emoji_type, message_id)
+        except Exception:
+            logger.exception("[Feishu] failed to add reaction '%s' to message %s", emoji_type, message_id)
+
+    async def _send_running_reply(self, message_id: str) -> None:
+        """Reply to a message in-thread with a 'Working on it...' hint."""
+        if not self._api_client:
+            return
+        try:
+            content = self._build_card_content("Working on it...")
+            request = self._ReplyMessageRequest.builder().message_id(message_id).request_body(self._ReplyMessageRequestBody.builder().msg_type("interactive").content(content).reply_in_thread(True).build()).build()
+            await asyncio.to_thread(self._api_client.im.v1.message.reply, request)
+            logger.info("[Feishu] 'Working on it......' reply sent for message %s", message_id)
+        except Exception:
+            logger.exception("[Feishu] failed to send running reply for message %s", message_id)
+
+    # -- internal ----------------------------------------------------------
+
+    @staticmethod
+    def _log_future_error(fut, name: str, msg_id: str) -> None:
+        """Callback for run_coroutine_threadsafe futures to surface errors."""
+        try:
+            exc = fut.exception()
+            if exc:
+                logger.error("[Feishu] %s failed for msg_id=%s: %s", name, msg_id, exc)
+        except Exception:
+            pass
+
+    def _on_message(self, event) -> None:
+        """Called by lark-oapi when a message is received (runs in lark thread)."""
+        try:
+            logger.info("[Feishu] raw event received: type=%s", type(event).__name__)
+            message = event.event.message
+            chat_id = message.chat_id
+            msg_id = message.message_id
+            sender_id = event.event.sender.sender_id.open_id
+
+            # root_id is set when the message is a reply within a Feishu thread.
+            # Use it as topic_id so all replies share the same DeerFlow thread.
+            root_id = getattr(message, "root_id", None) or None
+
+            # Parse message content
+            content = json.loads(message.content)
+            text = content.get("text", "").strip()
+            logger.info(
+                "[Feishu] parsed message: chat_id=%s, msg_id=%s, root_id=%s, sender=%s, text=%r",
+                chat_id,
+                msg_id,
+                root_id,
+                sender_id,
+                text[:100] if text else "",
+            )
+
+            if not text:
+                logger.info("[Feishu] empty text, ignoring message")
+                return
+
+            # Check if it's a command
+            if text.startswith("/"):
+                msg_type = InboundMessageType.COMMAND
+            else:
+                msg_type = InboundMessageType.CHAT
+
+            # topic_id: use root_id for replies (same topic), msg_id for new messages (new topic)
+            topic_id = root_id or msg_id
+
+            inbound = self._make_inbound(
+                chat_id=chat_id,
+                user_id=sender_id,
+                text=text,
+                msg_type=msg_type,
+                thread_ts=msg_id,
+                metadata={"message_id": msg_id, "root_id": root_id},
+            )
+            inbound.topic_id = topic_id
+
+            # Schedule on the async event loop
+            if self._main_loop and self._main_loop.is_running():
+                logger.info("[Feishu] publishing inbound message to bus (type=%s, msg_id=%s)", msg_type.value, msg_id)
+                # Schedule all coroutines and attach error logging to futures
+                for name, coro in [
+                    ("add_reaction", self._add_reaction(msg_id, "OK")),
+                    ("send_running_reply", self._send_running_reply(msg_id)),
+                    ("publish_inbound", self.bus.publish_inbound(inbound)),
+                ]:
+                    fut = asyncio.run_coroutine_threadsafe(coro, self._main_loop)
+                    fut.add_done_callback(lambda f, n=name, mid=msg_id: self._log_future_error(f, n, mid))
+            else:
+                logger.warning("[Feishu] main loop not running, cannot publish inbound message")
+        except Exception:
+            logger.exception("[Feishu] error processing message")
--- a/backend/src/channels/manager.py
+++ b/backend/src/channels/manager.py
@@ -0,0 +1,367 @@
+"""ChannelManager — consumes inbound messages and dispatches them to the DeerFlow agent via LangGraph Server."""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+
+from src.channels.message_bus import InboundMessage, InboundMessageType, MessageBus, OutboundMessage
+from src.channels.store import ChannelStore
+
+logger = logging.getLogger(__name__)
+
+DEFAULT_LANGGRAPH_URL = "http://localhost:2024"
+DEFAULT_GATEWAY_URL = "http://localhost:8001"
+DEFAULT_ASSISTANT_ID = "lead_agent"
+
+
+def _extract_response_text(result: dict | list) -> str:
+    """Extract the last AI message text from a LangGraph runs.wait result.
+
+    ``runs.wait`` returns the final state dict which contains a ``messages``
+    list.  Each message is a dict with at least ``type`` and ``content``.
+
+    Handles special cases:
+    - Regular AI text responses
+    - Clarification interrupts (``ask_clarification`` tool messages)
+    - AI messages with tool_calls but no text content
+    """
+    if isinstance(result, list):
+        messages = result
+    elif isinstance(result, dict):
+        messages = result.get("messages", [])
+    else:
+        return ""
+
+    # Walk backwards to find usable response text
+    for msg in reversed(messages):
+        if not isinstance(msg, dict):
+            continue
+
+        msg_type = msg.get("type")
+
+        # Check for tool messages from ask_clarification (interrupt case)
+        if msg_type == "tool" and msg.get("name") == "ask_clarification":
+            content = msg.get("content", "")
+            if isinstance(content, str) and content:
+                return content
+
+        # Regular AI message with text content
+        if msg_type == "ai":
+            content = msg.get("content", "")
+            if isinstance(content, str) and content:
+                return content
+            # content can be a list of content blocks
+            if isinstance(content, list):
+                parts = []
+                for block in content:
+                    if isinstance(block, dict) and block.get("type") == "text":
+                        parts.append(block.get("text", ""))
+                    elif isinstance(block, str):
+                        parts.append(block)
+                text = "".join(parts)
+                if text:
+                    return text
+    return ""
+
+
+def _extract_artifacts(result: dict | list) -> list[str]:
+    """Extract artifact paths from the last AI response cycle only.
+
+    Instead of reading the full accumulated ``artifacts`` state (which contains
+    all artifacts ever produced in the thread), this inspects the messages after
+    the last human message and collects file paths from ``present_files`` tool
+    calls.  This ensures only newly-produced artifacts are returned.
+    """
+    if isinstance(result, list):
+        messages = result
+    elif isinstance(result, dict):
+        messages = result.get("messages", [])
+    else:
+        return []
+
+    artifacts: list[str] = []
+    for msg in reversed(messages):
+        if not isinstance(msg, dict):
+            continue
+        # Stop at the last human message — anything before it is a previous turn
+        if msg.get("type") == "human":
+            break
+        # Look for AI messages with present_files tool calls
+        if msg.get("type") == "ai":
+            for tc in msg.get("tool_calls", []):
+                if isinstance(tc, dict) and tc.get("name") == "present_files":
+                    args = tc.get("args", {})
+                    paths = args.get("filepaths", [])
+                    if isinstance(paths, list):
+                        artifacts.extend(p for p in paths if isinstance(p, str))
+    return artifacts
+
+
+def _format_artifact_text(artifacts: list[str]) -> str:
+    """Format artifact paths into a human-readable text block listing filenames."""
+    import posixpath
+
+    filenames = [posixpath.basename(p) for p in artifacts]
+    if len(filenames) == 1:
+        return f"Created File: 📎 {filenames[0]}"
+    return "Created Files: 📎 " + "、".join(filenames)
+
+
+class ChannelManager:
+    """Core dispatcher that bridges IM channels to the DeerFlow agent.
+
+    It reads from the MessageBus inbound queue, creates/reuses threads on
+    the LangGraph Server, sends messages via ``runs.wait``, and publishes
+    outbound responses back through the bus.
+    """
+
+    def __init__(
+        self,
+        bus: MessageBus,
+        store: ChannelStore,
+        *,
+        max_concurrency: int = 5,
+        langgraph_url: str = DEFAULT_LANGGRAPH_URL,
+        gateway_url: str = DEFAULT_GATEWAY_URL,
+        assistant_id: str = DEFAULT_ASSISTANT_ID,
+    ) -> None:
+        self.bus = bus
+        self.store = store
+        self._max_concurrency = max_concurrency
+        self._langgraph_url = langgraph_url
+        self._gateway_url = gateway_url
+        self._assistant_id = assistant_id
+        self._client = None  # lazy init — langgraph_sdk async client
+        self._semaphore: asyncio.Semaphore | None = None
+        self._running = False
+        self._task: asyncio.Task | None = None
+
+    # -- LangGraph SDK client (lazy) ----------------------------------------
+
+    def _get_client(self):
+        """Return the ``langgraph_sdk`` async client, creating it on first use."""
+        if self._client is None:
+            from langgraph_sdk import get_client
+
+            self._client = get_client(url=self._langgraph_url)
+        return self._client
+
+    # -- lifecycle ---------------------------------------------------------
+
+    async def start(self) -> None:
+        """Start the dispatch loop."""
+        if self._running:
+            return
+        self._running = True
+        self._semaphore = asyncio.Semaphore(self._max_concurrency)
+        self._task = asyncio.create_task(self._dispatch_loop())
+        logger.info("ChannelManager started (max_concurrency=%d)", self._max_concurrency)
+
+    async def stop(self) -> None:
+        """Stop the dispatch loop."""
+        self._running = False
+        if self._task:
+            self._task.cancel()
+            try:
+                await self._task
+            except asyncio.CancelledError:
+                pass
+            self._task = None
+        logger.info("ChannelManager stopped")
+
+    # -- dispatch loop -----------------------------------------------------
+
+    async def _dispatch_loop(self) -> None:
+        logger.info("[Manager] dispatch loop started, waiting for inbound messages")
+        while self._running:
+            try:
+                msg = await asyncio.wait_for(self.bus.get_inbound(), timeout=1.0)
+            except TimeoutError:
+                continue
+            except asyncio.CancelledError:
+                break
+
+            logger.info(
+                "[Manager] received inbound: channel=%s, chat_id=%s, type=%s, text=%r",
+                msg.channel_name,
+                msg.chat_id,
+                msg.msg_type.value,
+                msg.text[:100] if msg.text else "",
+            )
+            task = asyncio.create_task(self._handle_message(msg))
+            task.add_done_callback(self._log_task_error)
+
+    @staticmethod
+    def _log_task_error(task: asyncio.Task) -> None:
+        """Surface unhandled exceptions from background tasks."""
+        if task.cancelled():
+            return
+        exc = task.exception()
+        if exc:
+            logger.error("[Manager] unhandled error in message task: %s", exc, exc_info=exc)
+
+    async def _handle_message(self, msg: InboundMessage) -> None:
+        async with self._semaphore:
+            try:
+                if msg.msg_type == InboundMessageType.COMMAND:
+                    await self._handle_command(msg)
+                else:
+                    await self._handle_chat(msg)
+            except Exception:
+                logger.exception(
+                    "Error handling message from %s (chat=%s)",
+                    msg.channel_name,
+                    msg.chat_id,
+                )
+                await self._send_error(msg, "An internal error occurred. Please try again.")
+
+    # -- chat handling -----------------------------------------------------
+
+    async def _create_thread(self, client, msg: InboundMessage) -> str:
+        """Create a new thread on the LangGraph Server and store the mapping."""
+        thread = await client.threads.create()
+        thread_id = thread["thread_id"]
+        self.store.set_thread_id(
+            msg.channel_name,
+            msg.chat_id,
+            thread_id,
+            topic_id=msg.topic_id,
+            user_id=msg.user_id,
+        )
+        logger.info("[Manager] new thread created on LangGraph Server: thread_id=%s for chat_id=%s topic_id=%s", thread_id, msg.chat_id, msg.topic_id)
+        return thread_id
+
+    async def _handle_chat(self, msg: InboundMessage) -> None:
+        client = self._get_client()
+
+        # Look up existing DeerFlow thread by topic_id (if present)
+        thread_id = None
+        if msg.topic_id:
+            thread_id = self.store.get_thread_id(msg.channel_name, msg.chat_id, topic_id=msg.topic_id)
+            if thread_id:
+                logger.info("[Manager] reusing thread: thread_id=%s for topic_id=%s", thread_id, msg.topic_id)
+
+        # No existing thread found — create a new one
+        if thread_id is None:
+            thread_id = await self._create_thread(client, msg)
+
+        logger.info("[Manager] invoking runs.wait(thread_id=%s, text=%r)", thread_id, msg.text[:100])
+        result = await client.runs.wait(
+            thread_id,
+            self._assistant_id,
+            input={"messages": [{"role": "human", "content": msg.text}]},
+            config={"recursion_limit": 100},
+            context={
+                "thread_id": thread_id,
+                "thinking_enabled": True,
+                "is_plan_mode": False,
+                "subagent_enabled": False,
+            },
+        )
+
+        response_text = _extract_response_text(result)
+        artifacts = _extract_artifacts(result)
+
+        logger.info(
+            "[Manager] agent response received: thread_id=%s, response_len=%d, artifacts=%d",
+            thread_id,
+            len(response_text) if response_text else 0,
+            len(artifacts),
+        )
+
+        # Append artifact filenames when present
+        if artifacts:
+            artifact_text = _format_artifact_text(artifacts)
+            if response_text:
+                response_text = response_text + "\n\n" + artifact_text
+            else:
+                response_text = artifact_text
+
+        if not response_text:
+            response_text = "(No response from agent)"
+
+        outbound = OutboundMessage(
+            channel_name=msg.channel_name,
+            chat_id=msg.chat_id,
+            thread_id=thread_id,
+            text=response_text,
+            artifacts=artifacts,
+            thread_ts=msg.thread_ts,
+        )
+        logger.info("[Manager] publishing outbound message to bus: channel=%s, chat_id=%s", msg.channel_name, msg.chat_id)
+        await self.bus.publish_outbound(outbound)
+
+    # -- command handling --------------------------------------------------
+
+    async def _handle_command(self, msg: InboundMessage) -> None:
+        text = msg.text.strip()
+        parts = text.split(maxsplit=1)
+        command = parts[0].lower().lstrip("/")
+
+        if command == "new":
+            # Create a new thread on the LangGraph Server
+            client = self._get_client()
+            thread = await client.threads.create()
+            new_thread_id = thread["thread_id"]
+            self.store.set_thread_id(
+                msg.channel_name,
+                msg.chat_id,
+                new_thread_id,
+                topic_id=msg.topic_id,
+                user_id=msg.user_id,
+            )
+            reply = "New conversation started."
+        elif command == "status":
+            thread_id = self.store.get_thread_id(msg.channel_name, msg.chat_id, topic_id=msg.topic_id)
+            reply = f"Active thread: {thread_id}" if thread_id else "No active conversation."
+        elif command == "models":
+            reply = await self._fetch_gateway("/api/models", "models")
+        elif command == "memory":
+            reply = await self._fetch_gateway("/api/memory", "memory")
+        elif command == "help":
+            reply = "Available commands:\n/new — Start a new conversation\n/status — Show current thread info\n/models — List available models\n/memory — Show memory status\n/help — Show this help"
+        else:
+            reply = f"Unknown command: /{command}. Type /help for available commands."
+
+        outbound = OutboundMessage(
+            channel_name=msg.channel_name,
+            chat_id=msg.chat_id,
+            thread_id=self.store.get_thread_id(msg.channel_name, msg.chat_id) or "",
+            text=reply,
+            thread_ts=msg.thread_ts,
+        )
+        await self.bus.publish_outbound(outbound)
+
+    async def _fetch_gateway(self, path: str, kind: str) -> str:
+        """Fetch data from the Gateway API for command responses."""
+        import httpx
+
+        try:
+            async with httpx.AsyncClient() as http:
+                resp = await http.get(f"{self._gateway_url}{path}", timeout=10)
+                resp.raise_for_status()
+                data = resp.json()
+        except Exception:
+            logger.exception("Failed to fetch %s from gateway", kind)
+            return f"Failed to fetch {kind} information."
+
+        if kind == "models":
+            names = [m["name"] for m in data.get("models", [])]
+            return ("Available models:\n" + "\n".join(f"• {n}" for n in names)) if names else "No models configured."
+        elif kind == "memory":
+            facts = data.get("facts", [])
+            return f"Memory contains {len(facts)} fact(s)."
+        return str(data)
+
+    # -- error helper ------------------------------------------------------
+
+    async def _send_error(self, msg: InboundMessage, error_text: str) -> None:
+        outbound = OutboundMessage(
+            channel_name=msg.channel_name,
+            chat_id=msg.chat_id,
+            thread_id=self.store.get_thread_id(msg.channel_name, msg.chat_id) or "",
+            text=error_text,
+            thread_ts=msg.thread_ts,
+        )
+        await self.bus.publish_outbound(outbound)
--- a/backend/src/channels/message_bus.py
+++ b/backend/src/channels/message_bus.py
@@ -0,0 +1,150 @@
+"""MessageBus — async pub/sub hub that decouples channels from the agent dispatcher."""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+import time
+from collections.abc import Callable, Coroutine
+from dataclasses import dataclass, field
+from enum import StrEnum
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Message types
+# ---------------------------------------------------------------------------
+
+
+class InboundMessageType(StrEnum):
+    """Types of messages arriving from IM channels."""
+
+    CHAT = "chat"
+    COMMAND = "command"
+
+
+@dataclass
+class InboundMessage:
+    """A message arriving from an IM channel toward the agent dispatcher.
+
+    Attributes:
+        channel_name: Name of the source channel (e.g. "feishu", "slack").
+        chat_id: Platform-specific chat/conversation identifier.
+        user_id: Platform-specific user identifier.
+        text: The message text.
+        msg_type: Whether this is a regular chat message or a command.
+        thread_ts: Optional platform thread identifier (for threaded replies).
+        topic_id: Conversation topic identifier used to map to a DeerFlow thread.
+            Messages sharing the same ``topic_id`` within a ``chat_id`` will
+            reuse the same DeerFlow thread.  When ``None``, each message
+            creates a new thread (one-shot Q&A).
+        files: Optional list of file attachments (platform-specific dicts).
+        metadata: Arbitrary extra data from the channel.
+        created_at: Unix timestamp when the message was created.
+    """
+
+    channel_name: str
+    chat_id: str
+    user_id: str
+    text: str
+    msg_type: InboundMessageType = InboundMessageType.CHAT
+    thread_ts: str | None = None
+    topic_id: str | None = None
+    files: list[dict[str, Any]] = field(default_factory=list)
+    metadata: dict[str, Any] = field(default_factory=dict)
+    created_at: float = field(default_factory=time.time)
+
+
+@dataclass
+class OutboundMessage:
+    """A message from the agent dispatcher back to a channel.
+
+    Attributes:
+        channel_name: Target channel name (used for routing).
+        chat_id: Target chat/conversation identifier.
+        thread_id: DeerFlow thread ID that produced this response.
+        text: The response text.
+        artifacts: List of artifact paths produced by the agent.
+        is_final: Whether this is the final message in the response stream.
+        thread_ts: Optional platform thread identifier for threaded replies.
+        metadata: Arbitrary extra data.
+        created_at: Unix timestamp.
+    """
+
+    channel_name: str
+    chat_id: str
+    thread_id: str
+    text: str
+    artifacts: list[str] = field(default_factory=list)
+    is_final: bool = True
+    thread_ts: str | None = None
+    metadata: dict[str, Any] = field(default_factory=dict)
+    created_at: float = field(default_factory=time.time)
+
+
+# ---------------------------------------------------------------------------
+# MessageBus
+# ---------------------------------------------------------------------------
+
+OutboundCallback = Callable[[OutboundMessage], Coroutine[Any, Any, None]]
+
+
+class MessageBus:
+    """Async pub/sub hub connecting channels and the agent dispatcher.
+
+    Channels publish inbound messages; the dispatcher consumes them.
+    The dispatcher publishes outbound messages; channels receive them
+    via registered callbacks.
+    """
+
+    def __init__(self) -> None:
+        self._inbound_queue: asyncio.Queue[InboundMessage] = asyncio.Queue()
+        self._outbound_listeners: list[OutboundCallback] = []
+
+    # -- inbound -----------------------------------------------------------
+
+    async def publish_inbound(self, msg: InboundMessage) -> None:
+        """Enqueue an inbound message from a channel."""
+        await self._inbound_queue.put(msg)
+        logger.info(
+            "[Bus] inbound enqueued: channel=%s, chat_id=%s, type=%s, queue_size=%d",
+            msg.channel_name,
+            msg.chat_id,
+            msg.msg_type.value,
+            self._inbound_queue.qsize(),
+        )
+
+    async def get_inbound(self) -> InboundMessage:
+        """Block until the next inbound message is available."""
+        return await self._inbound_queue.get()
+
+    @property
+    def inbound_queue(self) -> asyncio.Queue[InboundMessage]:
+        return self._inbound_queue
+
+    # -- outbound ----------------------------------------------------------
+
+    def subscribe_outbound(self, callback: OutboundCallback) -> None:
+        """Register an async callback for outbound messages."""
+        self._outbound_listeners.append(callback)
+
+    def unsubscribe_outbound(self, callback: OutboundCallback) -> None:
+        """Remove a previously registered outbound callback."""
+        self._outbound_listeners = [cb for cb in self._outbound_listeners if cb is not callback]
+
+    async def publish_outbound(self, msg: OutboundMessage) -> None:
+        """Dispatch an outbound message to all registered listeners."""
+        logger.info(
+            "[Bus] outbound dispatching: channel=%s, chat_id=%s, listeners=%d, text_len=%d",
+            msg.channel_name,
+            msg.chat_id,
+            len(self._outbound_listeners),
+            len(msg.text),
+        )
+        for callback in self._outbound_listeners:
+            try:
+                await callback(msg)
+            except Exception:
+                logger.exception("Error in outbound callback for channel=%s", msg.channel_name)
--- a/backend/src/channels/service.py
+++ b/backend/src/channels/service.py
@@ -0,0 +1,174 @@
+"""ChannelService — manages the lifecycle of all IM channels."""
+
+from __future__ import annotations
+
+import logging
+from typing import Any
+
+from src.channels.manager import ChannelManager
+from src.channels.message_bus import MessageBus
+from src.channels.store import ChannelStore
+
+logger = logging.getLogger(__name__)
+
+# Channel name → import path for lazy loading
+_CHANNEL_REGISTRY: dict[str, str] = {
+    "feishu": "src.channels.feishu:FeishuChannel",
+    "slack": "src.channels.slack:SlackChannel",
+    "telegram": "src.channels.telegram:TelegramChannel",
+}
+
+
+class ChannelService:
+    """Manages the lifecycle of all configured IM channels.
+
+    Reads configuration from ``config.yaml`` under the ``channels`` key,
+    instantiates enabled channels, and starts the ChannelManager dispatcher.
+    """
+
+    def __init__(self, channels_config: dict[str, Any] | None = None) -> None:
+        self.bus = MessageBus()
+        self.store = ChannelStore()
+        config = dict(channels_config or {})
+        langgraph_url = config.pop("langgraph_url", None) or "http://localhost:2024"
+        gateway_url = config.pop("gateway_url", None) or "http://localhost:8001"
+        self.manager = ChannelManager(
+            bus=self.bus,
+            store=self.store,
+            langgraph_url=langgraph_url,
+            gateway_url=gateway_url,
+        )
+        self._channels: dict[str, Any] = {}  # name -> Channel instance
+        self._config = config
+        self._running = False
+
+    @classmethod
+    def from_app_config(cls) -> ChannelService:
+        """Create a ChannelService from the application config."""
+        from src.config.app_config import get_app_config
+
+        config = get_app_config()
+        channels_config = {}
+        # extra fields are allowed by AppConfig (extra="allow")
+        extra = config.model_extra or {}
+        if "channels" in extra:
+            channels_config = extra["channels"]
+        return cls(channels_config=channels_config)
+
+    async def start(self) -> None:
+        """Start the manager and all enabled channels."""
+        if self._running:
+            return
+
+        await self.manager.start()
+
+        for name, channel_config in self._config.items():
+            if not isinstance(channel_config, dict):
+                continue
+            if not channel_config.get("enabled", False):
+                logger.info("Channel %s is disabled, skipping", name)
+                continue
+
+            await self._start_channel(name, channel_config)
+
+        self._running = True
+        logger.info("ChannelService started with channels: %s", list(self._channels.keys()))
+
+    async def stop(self) -> None:
+        """Stop all channels and the manager."""
+        for name, channel in list(self._channels.items()):
+            try:
+                await channel.stop()
+                logger.info("Channel %s stopped", name)
+            except Exception:
+                logger.exception("Error stopping channel %s", name)
+        self._channels.clear()
+
+        await self.manager.stop()
+        self._running = False
+        logger.info("ChannelService stopped")
+
+    async def restart_channel(self, name: str) -> bool:
+        """Restart a specific channel. Returns True if successful."""
+        if name in self._channels:
+            try:
+                await self._channels[name].stop()
+            except Exception:
+                logger.exception("Error stopping channel %s for restart", name)
+            del self._channels[name]
+
+        config = self._config.get(name)
+        if not config or not isinstance(config, dict):
+            logger.warning("No config for channel %s", name)
+            return False
+
+        return await self._start_channel(name, config)
+
+    async def _start_channel(self, name: str, config: dict[str, Any]) -> bool:
+        """Instantiate and start a single channel."""
+        import_path = _CHANNEL_REGISTRY.get(name)
+        if not import_path:
+            logger.warning("Unknown channel type: %s", name)
+            return False
+
+        try:
+            from src.reflection import resolve_class
+
+            channel_cls = resolve_class(import_path, base_class=None)
+        except Exception:
+            logger.exception("Failed to import channel class for %s", name)
+            return False
+
+        try:
+            channel = channel_cls(bus=self.bus, config=config)
+            await channel.start()
+            self._channels[name] = channel
+            logger.info("Channel %s started", name)
+            return True
+        except Exception:
+            logger.exception("Failed to start channel %s", name)
+            return False
+
+    def get_status(self) -> dict[str, Any]:
+        """Return status information for all channels."""
+        channels_status = {}
+        for name in _CHANNEL_REGISTRY:
+            config = self._config.get(name, {})
+            enabled = isinstance(config, dict) and config.get("enabled", False)
+            running = name in self._channels and self._channels[name].is_running
+            channels_status[name] = {
+                "enabled": enabled,
+                "running": running,
+            }
+        return {
+            "service_running": self._running,
+            "channels": channels_status,
+        }
+
+
+# -- singleton access -------------------------------------------------------
+
+_channel_service: ChannelService | None = None
+
+
+def get_channel_service() -> ChannelService | None:
+    """Get the singleton ChannelService instance (if started)."""
+    return _channel_service
+
+
+async def start_channel_service() -> ChannelService:
+    """Create and start the global ChannelService from app config."""
+    global _channel_service
+    if _channel_service is not None:
+        return _channel_service
+    _channel_service = ChannelService.from_app_config()
+    await _channel_service.start()
+    return _channel_service
+
+
+async def stop_channel_service() -> None:
+    """Stop the global ChannelService."""
+    global _channel_service
+    if _channel_service is not None:
+        await _channel_service.stop()
+        _channel_service = None
--- a/backend/src/channels/slack.py
+++ b/backend/src/channels/slack.py
@@ -0,0 +1,223 @@
+"""Slack channel — connects via Socket Mode (no public IP needed)."""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+from typing import Any
+
+from markdown_to_mrkdwn import SlackMarkdownConverter
+
+from src.channels.base import Channel
+from src.channels.message_bus import InboundMessageType, MessageBus, OutboundMessage
+
+logger = logging.getLogger(__name__)
+
+_slack_md_converter = SlackMarkdownConverter()
+
+
+class SlackChannel(Channel):
+    """Slack IM channel using Socket Mode (WebSocket, no public IP).
+
+    Configuration keys (in ``config.yaml`` under ``channels.slack``):
+        - ``bot_token``: Slack Bot User OAuth Token (xoxb-...).
+        - ``app_token``: Slack App-Level Token (xapp-...) for Socket Mode.
+        - ``allowed_users``: (optional) List of allowed Slack user IDs. Empty = allow all.
+    """
+
+    def __init__(self, bus: MessageBus, config: dict[str, Any]) -> None:
+        super().__init__(name="slack", bus=bus, config=config)
+        self._socket_client = None
+        self._web_client = None
+        self._loop: asyncio.AbstractEventLoop | None = None
+        self._allowed_users: set[str] = set(config.get("allowed_users", []))
+
+    async def start(self) -> None:
+        if self._running:
+            return
+
+        try:
+            from slack_sdk import WebClient
+            from slack_sdk.socket_mode import SocketModeClient
+            from slack_sdk.socket_mode.response import SocketModeResponse
+        except ImportError:
+            logger.error("slack-sdk is not installed. Install it with: uv add slack-sdk")
+            return
+
+        self._SocketModeResponse = SocketModeResponse
+
+        bot_token = self.config.get("bot_token", "")
+        app_token = self.config.get("app_token", "")
+
+        if not bot_token or not app_token:
+            logger.error("Slack channel requires bot_token and app_token")
+            return
+
+        self._web_client = WebClient(token=bot_token)
+        self._socket_client = SocketModeClient(
+            app_token=app_token,
+            web_client=self._web_client,
+        )
+        self._loop = asyncio.get_event_loop()
+
+        self._socket_client.socket_mode_request_listeners.append(self._on_socket_event)
+
+        self._running = True
+        self.bus.subscribe_outbound(self._on_outbound)
+
+        # Start socket mode in background thread
+        asyncio.get_event_loop().run_in_executor(None, self._socket_client.connect)
+        logger.info("Slack channel started")
+
+    async def stop(self) -> None:
+        self._running = False
+        self.bus.unsubscribe_outbound(self._on_outbound)
+        if self._socket_client:
+            self._socket_client.close()
+            self._socket_client = None
+        logger.info("Slack channel stopped")
+
+    async def send(self, msg: OutboundMessage, *, _max_retries: int = 3) -> None:
+        if not self._web_client:
+            return
+
+        kwargs: dict[str, Any] = {
+            "channel": msg.chat_id,
+            "text": _slack_md_converter.convert(msg.text),
+        }
+        if msg.thread_ts:
+            kwargs["thread_ts"] = msg.thread_ts
+
+        last_exc: Exception | None = None
+        for attempt in range(_max_retries):
+            try:
+                await asyncio.to_thread(self._web_client.chat_postMessage, **kwargs)
+                # Add a completion reaction to the thread root
+                if msg.thread_ts:
+                    await asyncio.to_thread(
+                        self._add_reaction,
+                        msg.chat_id,
+                        msg.thread_ts,
+                        "white_check_mark",
+                    )
+                return
+            except Exception as exc:
+                last_exc = exc
+                if attempt < _max_retries - 1:
+                    delay = 2**attempt  # 1s, 2s
+                    logger.warning(
+                        "[Slack] send failed (attempt %d/%d), retrying in %ds: %s",
+                        attempt + 1,
+                        _max_retries,
+                        delay,
+                        exc,
+                    )
+                    await asyncio.sleep(delay)
+
+        logger.error("[Slack] send failed after %d attempts: %s", _max_retries, last_exc)
+        # Add failure reaction on error
+        if msg.thread_ts:
+            try:
+                await asyncio.to_thread(
+                    self._add_reaction,
+                    msg.chat_id,
+                    msg.thread_ts,
+                    "x",
+                )
+            except Exception:
+                pass
+        raise last_exc  # type: ignore[misc]
+
+    # -- internal ----------------------------------------------------------
+
+    def _add_reaction(self, channel_id: str, timestamp: str, emoji: str) -> None:
+        """Add an emoji reaction to a message (best-effort, non-blocking)."""
+        if not self._web_client:
+            return
+        try:
+            self._web_client.reactions_add(
+                channel=channel_id,
+                timestamp=timestamp,
+                name=emoji,
+            )
+        except Exception as exc:
+            if "already_reacted" not in str(exc):
+                logger.warning("[Slack] failed to add reaction %s: %s", emoji, exc)
+
+    def _send_running_reply(self, channel_id: str, thread_ts: str) -> None:
+        """Send a 'Working on it......' reply in the thread (called from SDK thread)."""
+        if not self._web_client:
+            return
+        try:
+            self._web_client.chat_postMessage(
+                channel=channel_id,
+                text=":hourglass_flowing_sand: Working on it...",
+                thread_ts=thread_ts,
+            )
+            logger.info("[Slack] 'Working on it...' reply sent in channel=%s, thread_ts=%s", channel_id, thread_ts)
+        except Exception:
+            logger.exception("[Slack] failed to send running reply in channel=%s", channel_id)
+
+    def _on_socket_event(self, client, req) -> None:
+        """Called by slack-sdk for each Socket Mode event."""
+        try:
+            # Acknowledge the event
+            response = self._SocketModeResponse(envelope_id=req.envelope_id)
+            client.send_socket_mode_response(response)
+
+            event_type = req.type
+            if event_type != "events_api":
+                return
+
+            event = req.payload.get("event", {})
+            etype = event.get("type", "")
+
+            # Handle message events (DM or @mention)
+            if etype in ("message", "app_mention"):
+                self._handle_message_event(event)
+
+        except Exception:
+            logger.exception("Error processing Slack event")
+
+    def _handle_message_event(self, event: dict) -> None:
+        # Ignore bot messages
+        if event.get("bot_id") or event.get("subtype"):
+            return
+
+        user_id = event.get("user", "")
+
+        # Check allowed users
+        if self._allowed_users and user_id not in self._allowed_users:
+            logger.debug("Ignoring message from non-allowed user: %s", user_id)
+            return
+
+        text = event.get("text", "").strip()
+        if not text:
+            return
+
+        channel_id = event.get("channel", "")
+        thread_ts = event.get("thread_ts") or event.get("ts", "")
+
+        if text.startswith("/"):
+            msg_type = InboundMessageType.COMMAND
+        else:
+            msg_type = InboundMessageType.CHAT
+
+        # topic_id: use thread_ts as the topic identifier.
+        # For threaded messages, thread_ts is the root message ts (shared topic).
+        # For non-threaded messages, thread_ts is the message's own ts (new topic).
+        inbound = self._make_inbound(
+            chat_id=channel_id,
+            user_id=user_id,
+            text=text,
+            msg_type=msg_type,
+            thread_ts=thread_ts,
+        )
+        inbound.topic_id = thread_ts
+
+        if self._loop and self._loop.is_running():
+            # Acknowledge with an eyes reaction
+            self._add_reaction(channel_id, event.get("ts", thread_ts), "eyes")
+            # Send "running" reply first (fire-and-forget from SDK thread)
+            self._send_running_reply(channel_id, thread_ts)
+            asyncio.run_coroutine_threadsafe(self.bus.publish_inbound(inbound), self._loop)
--- a/backend/src/channels/store.py
+++ b/backend/src/channels/store.py
@@ -0,0 +1,153 @@
+"""ChannelStore — persists IM chat-to-DeerFlow thread mappings."""
+
+from __future__ import annotations
+
+import json
+import logging
+import tempfile
+import threading
+import time
+from pathlib import Path
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+
+class ChannelStore:
+    """JSON-file-backed store that maps IM conversations to DeerFlow threads.
+
+    Data layout (on disk)::
+
+        {
+            "<channel_name>:<chat_id>": {
+                "thread_id": "<uuid>",
+                "user_id": "<platform_user>",
+                "created_at": 1700000000.0,
+                "updated_at": 1700000000.0
+            },
+            ...
+        }
+
+    The store is intentionally simple — a single JSON file that is atomically
+    rewritten on every mutation. For production workloads with high concurrency,
+    this can be swapped for a proper database backend.
+    """
+
+    def __init__(self, path: str | Path | None = None) -> None:
+        if path is None:
+            from src.config.paths import get_paths
+
+            path = Path(get_paths().base_dir) / "channels" / "store.json"
+        self._path = Path(path)
+        self._path.parent.mkdir(parents=True, exist_ok=True)
+        self._data: dict[str, dict[str, Any]] = self._load()
+        self._lock = threading.Lock()
+
+    # -- persistence -------------------------------------------------------
+
+    def _load(self) -> dict[str, dict[str, Any]]:
+        if self._path.exists():
+            try:
+                return json.loads(self._path.read_text(encoding="utf-8"))
+            except (json.JSONDecodeError, OSError):
+                logger.warning("Corrupt channel store at %s, starting fresh", self._path)
+        return {}
+
+    def _save(self) -> None:
+        fd = tempfile.NamedTemporaryFile(
+            mode="w",
+            dir=self._path.parent,
+            suffix=".tmp",
+            delete=False,
+        )
+        try:
+            json.dump(self._data, fd, indent=2)
+            fd.close()
+            Path(fd.name).replace(self._path)
+        except BaseException:
+            fd.close()
+            Path(fd.name).unlink(missing_ok=True)
+            raise
+
+    # -- key helpers -------------------------------------------------------
+
+    @staticmethod
+    def _key(channel_name: str, chat_id: str, topic_id: str | None = None) -> str:
+        if topic_id:
+            return f"{channel_name}:{chat_id}:{topic_id}"
+        return f"{channel_name}:{chat_id}"
+
+    # -- public API --------------------------------------------------------
+
+    def get_thread_id(self, channel_name: str, chat_id: str, topic_id: str | None = None) -> str | None:
+        """Look up the DeerFlow thread_id for a given IM conversation/topic."""
+        entry = self._data.get(self._key(channel_name, chat_id, topic_id))
+        return entry["thread_id"] if entry else None
+
+    def set_thread_id(
+        self,
+        channel_name: str,
+        chat_id: str,
+        thread_id: str,
+        *,
+        topic_id: str | None = None,
+        user_id: str = "",
+    ) -> None:
+        """Create or update the mapping for an IM conversation/topic."""
+        with self._lock:
+            key = self._key(channel_name, chat_id, topic_id)
+            now = time.time()
+            existing = self._data.get(key)
+            self._data[key] = {
+                "thread_id": thread_id,
+                "user_id": user_id,
+                "created_at": existing["created_at"] if existing else now,
+                "updated_at": now,
+            }
+            self._save()
+
+    def remove(self, channel_name: str, chat_id: str, topic_id: str | None = None) -> bool:
+        """Remove a mapping.
+
+        If ``topic_id`` is provided, only that specific conversation/topic mapping is removed.
+        If ``topic_id`` is omitted, all mappings whose key starts with
+        ``"<channel_name>:<chat_id>"`` (including topic-specific ones) are removed.
+
+        Returns True if at least one mapping was removed.
+        """
+        with self._lock:
+            # Remove a specific conversation/topic mapping.
+            if topic_id is not None:
+                key = self._key(channel_name, chat_id, topic_id)
+                if key in self._data:
+                    del self._data[key]
+                    self._save()
+                    return True
+                return False
+
+            # Remove all mappings for this channel/chat_id (base and any topic-specific keys).
+            prefix = self._key(channel_name, chat_id)
+            keys_to_delete = [k for k in self._data if k == prefix or k.startswith(prefix + ":")]
+            if not keys_to_delete:
+                return False
+
+            for k in keys_to_delete:
+                del self._data[k]
+            self._save()
+            return True
+
+    def list_entries(self, channel_name: str | None = None) -> list[dict[str, Any]]:
+        """List all stored mappings, optionally filtered by channel."""
+        results = []
+        for key, entry in self._data.items():
+            parts = key.split(":", 2)
+            ch = parts[0]
+            chat = parts[1] if len(parts) > 1 else ""
+            topic = parts[2] if len(parts) > 2 else None
+            if channel_name and ch != channel_name:
+                continue
+            item: dict[str, Any] = {"channel_name": ch, "chat_id": chat, **entry}
+            if topic is not None:
+                item["topic_id"] = topic
+            results.append(item)
+        return results
--- a/backend/src/channels/telegram.py
+++ b/backend/src/channels/telegram.py
@@ -0,0 +1,225 @@
+"""Telegram channel — connects via long-polling (no public IP needed)."""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+import threading
+from typing import Any
+
+from src.channels.base import Channel
+from src.channels.message_bus import InboundMessageType, MessageBus, OutboundMessage
+
+logger = logging.getLogger(__name__)
+
+
+class TelegramChannel(Channel):
+    """Telegram bot channel using long-polling.
+
+    Configuration keys (in ``config.yaml`` under ``channels.telegram``):
+        - ``bot_token``: Telegram Bot API token (from @BotFather).
+        - ``allowed_users``: (optional) List of allowed Telegram user IDs. Empty = allow all.
+    """
+
+    def __init__(self, bus: MessageBus, config: dict[str, Any]) -> None:
+        super().__init__(name="telegram", bus=bus, config=config)
+        self._application = None
+        self._thread: threading.Thread | None = None
+        self._tg_loop: asyncio.AbstractEventLoop | None = None
+        self._main_loop: asyncio.AbstractEventLoop | None = None
+        self._allowed_users: set[int] = set()
+        for uid in config.get("allowed_users", []):
+            try:
+                self._allowed_users.add(int(uid))
+            except (ValueError, TypeError):
+                pass
+        # chat_id -> last sent message_id for threaded replies
+        self._last_bot_message: dict[str, int] = {}
+
+    async def start(self) -> None:
+        if self._running:
+            return
+
+        try:
+            from telegram.ext import ApplicationBuilder, CommandHandler, MessageHandler, filters
+        except ImportError:
+            logger.error("python-telegram-bot is not installed. Install it with: uv add python-telegram-bot")
+            return
+
+        bot_token = self.config.get("bot_token", "")
+        if not bot_token:
+            logger.error("Telegram channel requires bot_token")
+            return
+
+        self._main_loop = asyncio.get_event_loop()
+        self._running = True
+        self.bus.subscribe_outbound(self._on_outbound)
+
+        # Build the application
+        app = ApplicationBuilder().token(bot_token).build()
+
+        # Command handlers
+        app.add_handler(CommandHandler("start", self._cmd_start))
+        app.add_handler(CommandHandler("new", self._cmd_generic))
+        app.add_handler(CommandHandler("status", self._cmd_generic))
+        app.add_handler(CommandHandler("models", self._cmd_generic))
+        app.add_handler(CommandHandler("memory", self._cmd_generic))
+        app.add_handler(CommandHandler("help", self._cmd_generic))
+
+        # General message handler
+        app.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, self._on_text))
+
+        self._application = app
+
+        # Run polling in a dedicated thread with its own event loop
+        self._thread = threading.Thread(target=self._run_polling, daemon=True)
+        self._thread.start()
+        logger.info("Telegram channel started")
+
+    async def stop(self) -> None:
+        self._running = False
+        self.bus.unsubscribe_outbound(self._on_outbound)
+        if self._application and self._tg_loop:
+            self._tg_loop.call_soon_threadsafe(self._tg_loop.stop)
+        if self._thread:
+            self._thread.join(timeout=5)
+            self._thread = None
+        self._application = None
+        logger.info("Telegram channel stopped")
+
+    async def send(self, msg: OutboundMessage, *, _max_retries: int = 3) -> None:
+        if not self._application:
+            return
+
+        try:
+            chat_id = int(msg.chat_id)
+        except (ValueError, TypeError):
+            logger.error("Invalid Telegram chat_id: %s", msg.chat_id)
+            return
+
+        kwargs: dict[str, Any] = {"chat_id": chat_id, "text": msg.text}
+
+        # Reply to the last bot message in this chat for threading
+        reply_to = self._last_bot_message.get(msg.chat_id)
+        if reply_to:
+            kwargs["reply_to_message_id"] = reply_to
+
+        bot = self._application.bot
+        last_exc: Exception | None = None
+        for attempt in range(_max_retries):
+            try:
+                sent = await bot.send_message(**kwargs)
+                self._last_bot_message[msg.chat_id] = sent.message_id
+                return
+            except Exception as exc:
+                last_exc = exc
+                if attempt < _max_retries - 1:
+                    delay = 2**attempt  # 1s, 2s
+                    logger.warning(
+                        "[Telegram] send failed (attempt %d/%d), retrying in %ds: %s",
+                        attempt + 1,
+                        _max_retries,
+                        delay,
+                        exc,
+                    )
+                    await asyncio.sleep(delay)
+
+        logger.error("[Telegram] send failed after %d attempts: %s", _max_retries, last_exc)
+        raise last_exc  # type: ignore[misc]
+
+    # -- helpers -----------------------------------------------------------
+
+    async def _send_running_reply(self, chat_id: str, reply_to_message_id: int) -> None:
+        """Send a 'Working on it...' reply to the user's message."""
+        if not self._application:
+            return
+        try:
+            bot = self._application.bot
+            await bot.send_message(
+                chat_id=int(chat_id),
+                text="Working on it...",
+                reply_to_message_id=reply_to_message_id,
+            )
+            logger.info("[Telegram] 'Working on it...' reply sent in chat=%s", chat_id)
+        except Exception:
+            logger.exception("[Telegram] failed to send running reply in chat=%s", chat_id)
+
+    # -- internal ----------------------------------------------------------
+
+    def _run_polling(self) -> None:
+        """Run telegram polling in a dedicated thread."""
+        self._tg_loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(self._tg_loop)
+        try:
+            self._tg_loop.run_until_complete(self._application.run_polling(close_loop=False))
+        except Exception:
+            if self._running:
+                logger.exception("Telegram polling error")
+
+    def _check_user(self, user_id: int) -> bool:
+        if not self._allowed_users:
+            return True
+        return user_id in self._allowed_users
+
+    async def _cmd_start(self, update, context) -> None:
+        """Handle /start command."""
+        if not self._check_user(update.effective_user.id):
+            return
+        await update.message.reply_text("Welcome to DeerFlow! Send me a message to start a conversation.\nType /help for available commands.")
+
+    async def _cmd_generic(self, update, context) -> None:
+        """Forward slash commands to the channel manager."""
+        if not self._check_user(update.effective_user.id):
+            return
+
+        text = update.message.text
+        chat_id = str(update.effective_chat.id)
+        user_id = str(update.effective_user.id)
+        msg_id = str(update.message.message_id)
+
+        inbound = self._make_inbound(
+            chat_id=chat_id,
+            user_id=user_id,
+            text=text,
+            msg_type=InboundMessageType.COMMAND,
+            thread_ts=msg_id,
+        )
+
+        if self._main_loop and self._main_loop.is_running():
+            asyncio.run_coroutine_threadsafe(self._send_running_reply(chat_id, update.message.message_id), self._main_loop)
+            asyncio.run_coroutine_threadsafe(self.bus.publish_inbound(inbound), self._main_loop)
+
+    async def _on_text(self, update, context) -> None:
+        """Handle regular text messages."""
+        if not self._check_user(update.effective_user.id):
+            return
+
+        text = update.message.text.strip()
+        if not text:
+            return
+
+        chat_id = str(update.effective_chat.id)
+        user_id = str(update.effective_user.id)
+        msg_id = str(update.message.message_id)
+
+        # topic_id: if the user is replying to a bot message, look up
+        # the original topic_id stored for that reply chain.  Otherwise
+        # the current message starts a new topic.
+        reply_to = update.message.reply_to_message
+        if reply_to:
+            topic_id = str(reply_to.message_id)
+        else:
+            topic_id = msg_id
+
+        inbound = self._make_inbound(
+            chat_id=chat_id,
+            user_id=user_id,
+            text=text,
+            msg_type=InboundMessageType.CHAT,
+            thread_ts=msg_id,
+        )
+        inbound.topic_id = topic_id
+
+        if self._main_loop and self._main_loop.is_running():
+            asyncio.run_coroutine_threadsafe(self._send_running_reply(chat_id, update.message.message_id), self._main_loop)
+            asyncio.run_coroutine_threadsafe(self.bus.publish_inbound(inbound), self._main_loop)
--- a/backend/src/community/infoquest/infoquest_client.py
+++ b/backend/src/community/infoquest/infoquest_client.py
@@ -82,8 +82,7 @@ class InfoQuestClient:
                    return response_data["reader_result"]
                elif "content" in response_data:
                    # Fallback to content field if reader_result is not available
-                    logger.debug("reader_result missing in JSON response, falling back to content field: %s",
-                                 response_data["content"])
+                    logger.debug("reader_result missing in JSON response, falling back to content field: %s", response_data["content"])
                    return response_data["content"]
                else:
                    # If neither field exists, return the original response
--- a/backend/src/gateway/app.py
+++ b/backend/src/gateway/app.py
@@ -10,6 +10,7 @@ from src.gateway.config import get_gateway_config
 from src.gateway.routers import (
    agents,
    artifacts,
+    channels,
    mcp,
    memory,
    models,
@@ -47,7 +48,24 @@ async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    # 2. Gateway and LangGraph Server are separate processes with independent caches
    # MCP tools are lazily initialized in LangGraph Server when first needed

+    # Start IM channel service if any channels are configured
+    try:
+        from src.channels.service import start_channel_service
+
+        channel_service = await start_channel_service()
+        logger.info("Channel service started: %s", channel_service.get_status())
+    except Exception:
+        logger.exception("No IM channels configured or channel service failed to start")
+
    yield
+
+    # Stop channel service on shutdown
+    try:
+        from src.channels.service import stop_channel_service
+
+        await stop_channel_service()
+    except Exception:
+        logger.exception("Failed to stop channel service")
    logger.info("Shutting down API Gateway")


@@ -117,6 +135,10 @@ This gateway provides custom endpoints for models, MCP configuration, skills, an
                "name": "suggestions",
                "description": "Generate follow-up question suggestions for conversations",
            },
+            {
+                "name": "channels",
+                "description": "Manage IM channel integrations (Feishu, Slack, Telegram)",
+            },
            {
                "name": "health",
                "description": "Health check and system status endpoints",
@@ -151,6 +173,9 @@ This gateway provides custom endpoints for models, MCP configuration, skills, an
    # Suggestions API is mounted at /api/threads/{thread_id}/suggestions
    app.include_router(suggestions.router)

+    # Channels API is mounted at /api/channels
+    app.include_router(channels.router)
+
    @app.get("/health", tags=["health"])
    async def health_check() -> dict:
        """Health check endpoint.
--- a/backend/src/gateway/routers/channels.py
+++ b/backend/src/gateway/routers/channels.py
@@ -0,0 +1,52 @@
+"""Gateway router for IM channel management."""
+
+from __future__ import annotations
+
+import logging
+
+from fastapi import APIRouter, HTTPException
+from pydantic import BaseModel
+
+logger = logging.getLogger(__name__)
+
+router = APIRouter(prefix="/api/channels", tags=["channels"])
+
+
+class ChannelStatusResponse(BaseModel):
+    service_running: bool
+    channels: dict[str, dict]
+
+
+class ChannelRestartResponse(BaseModel):
+    success: bool
+    message: str
+
+
+@router.get("/", response_model=ChannelStatusResponse)
+async def get_channels_status() -> ChannelStatusResponse:
+    """Get the status of all IM channels."""
+    from src.channels.service import get_channel_service
+
+    service = get_channel_service()
+    if service is None:
+        return ChannelStatusResponse(service_running=False, channels={})
+    status = service.get_status()
+    return ChannelStatusResponse(**status)
+
+
+@router.post("/{name}/restart", response_model=ChannelRestartResponse)
+async def restart_channel(name: str) -> ChannelRestartResponse:
+    """Restart a specific IM channel."""
+    from src.channels.service import get_channel_service
+
+    service = get_channel_service()
+    if service is None:
+        raise HTTPException(status_code=503, detail="Channel service is not running")
+
+    success = await service.restart_channel(name)
+    if success:
+        logger.info("Channel %s restarted successfully", name)
+        return ChannelRestartResponse(success=True, message=f"Channel {name} restarted successfully")
+    else:
+        logger.warning("Failed to restart channel %s", name)
+        return ChannelRestartResponse(success=False, message=f"Failed to restart channel {name}")
--- a/backend/src/reflection/resolvers.py
+++ b/backend/src/reflection/resolvers.py
@@ -19,10 +19,7 @@ def _build_missing_dependency_hint(module_path: str, err: ImportError) -> str:
    if package_name is None:
        package_name = MODULE_TO_PACKAGE_HINTS.get(missing_module, missing_module.replace("_", "-"))

-    return (
-        f"Missing dependency '{missing_module}'. "
-        f"Install it with `uv add {package_name}` (or `pip install {package_name}`), then restart DeerFlow."
-    )
+    return f"Missing dependency '{missing_module}'. Install it with `uv add {package_name}` (or `pip install {package_name}`), then restart DeerFlow."


 def resolve_variable[T](
--- a/backend/src/sandbox/local/local_sandbox.py
+++ b/backend/src/sandbox/local/local_sandbox.py
@@ -147,10 +147,7 @@ class LocalSandbox(Sandbox):
        shell_from_path = shutil.which("sh")
        if shell_from_path is not None:
            return shell_from_path
-        raise RuntimeError(
-            "No suitable shell executable found. Tried /bin/zsh, /bin/bash, "
-            "/bin/sh, and `sh` on PATH."
-        )
+        raise RuntimeError("No suitable shell executable found. Tried /bin/zsh, /bin/bash, /bin/sh, and `sh` on PATH.")

    def execute_command(self, command: str) -> str:
        # Resolve container paths in command before execution
--- a/backend/src/tools/builtins/present_file_tool.py
+++ b/backend/src/tools/builtins/present_file_tool.py
@@ -54,9 +54,7 @@ def _normalize_presented_filepath(
    try:
        relative_path = actual_path.relative_to(outputs_dir)
    except ValueError as exc:
-        raise ValueError(
-            f"Only files in {OUTPUTS_VIRTUAL_PREFIX} can be presented: {filepath}"
-        ) from exc
+        raise ValueError(f"Only files in {OUTPUTS_VIRTUAL_PREFIX} can be presented: {filepath}") from exc

    return f"{OUTPUTS_VIRTUAL_PREFIX}/{relative_path.as_posix()}"

@@ -87,22 +85,16 @@ def present_file_tool(
        filepaths: List of absolute file paths to present to the user. **Only** files in `/mnt/user-data/outputs` can be presented.
    """
    try:
-        normalized_paths = [
-            _normalize_presented_filepath(runtime, filepath) for filepath in filepaths
-        ]
+        normalized_paths = [_normalize_presented_filepath(runtime, filepath) for filepath in filepaths]
    except ValueError as exc:
        return Command(
-            update={
-                "messages": [ToolMessage(f"Error: {exc}", tool_call_id=tool_call_id)]
-            },
+            update={"messages": [ToolMessage(f"Error: {exc}", tool_call_id=tool_call_id)]},
        )

    # The merge_artifacts reducer will handle merging and deduplication
    return Command(
        update={
            "artifacts": normalized_paths,
-            "messages": [
-                ToolMessage("Successfully presented files", tool_call_id=tool_call_id)
-            ],
+            "messages": [ToolMessage("Successfully presented files", tool_call_id=tool_call_id)],
        },
    )
--- a/backend/tests/test_channels.py
+++ b/backend/tests/test_channels.py
--- a/backend/tests/test_client.py
+++ b/backend/tests/test_client.py
@@ -20,6 +20,7 @@ from src.gateway.routers.uploads import UploadResponse
 # Fixtures
 # ---------------------------------------------------------------------------

+
@pytest.fixture
 def mock_app_config():
    """Provide a minimal AppConfig mock."""
@@ -45,6 +46,7 @@ def client(mock_app_config):
 # __init__
 # ---------------------------------------------------------------------------

+
 class TestClientInit:
    def test_default_params(self, client):
        assert client._model_name is None
@@ -86,6 +88,7 @@ class TestClientInit:
 # list_models / list_skills / get_memory
 # ---------------------------------------------------------------------------

+
 class TestConfigQueries:
    def test_list_models(self, client):
        result = client.list_models()
@@ -135,6 +138,7 @@ class TestConfigQueries:
 # stream / chat
 # ---------------------------------------------------------------------------

+
 def _make_agent_mock(chunks: list[dict]):
    """Create a mock agent whose .stream() yields the given chunks."""
    agent = MagicMock()
@@ -314,6 +318,7 @@ class TestChat:
 # _extract_text
 # ---------------------------------------------------------------------------

+
 class TestExtractText:
    def test_string(self):
        assert DeerFlowClient._extract_text("hello") == "hello"
@@ -340,6 +345,7 @@ class TestExtractText:
 # _ensure_agent
 # ---------------------------------------------------------------------------

+
 class TestEnsureAgent:
    def test_creates_agent(self, client):
        """_ensure_agent creates an agent on first call."""
@@ -374,6 +380,7 @@ class TestEnsureAgent:
 # get_model
 # ---------------------------------------------------------------------------

+
 class TestGetModel:
    def test_found(self, client):
        model_cfg = MagicMock()
@@ -402,6 +409,7 @@ class TestGetModel:
 # MCP config
 # ---------------------------------------------------------------------------

+
 class TestMcpConfig:
    def test_get_mcp_config(self, client):
        server = MagicMock()
@@ -457,6 +465,7 @@ class TestMcpConfig:
 # Skills management
 # ---------------------------------------------------------------------------

+
 class TestSkillsManagement:
    def _make_skill(self, name="test-skill", enabled=True):
        s = MagicMock()
@@ -556,6 +565,7 @@ class TestSkillsManagement:
 # Memory management
 # ---------------------------------------------------------------------------

+
 class TestMemoryManagement:
    def test_reload_memory(self, client):
        data = {"version": "1.0", "facts": []}
@@ -605,6 +615,7 @@ class TestMemoryManagement:
 # Uploads
 # ---------------------------------------------------------------------------

+
 class TestUploads:
    def test_upload_files(self, client):
        with tempfile.TemporaryDirectory() as tmp:
@@ -678,6 +689,7 @@ class TestUploads:
 # Artifacts
 # ---------------------------------------------------------------------------

+
 class TestArtifacts:
    def test_get_artifact(self, client):
        with tempfile.TemporaryDirectory() as tmp:
@@ -759,9 +771,13 @@ class TestScenarioMultiTurnConversation:

    def test_stream_collects_all_event_types_across_turns(self, client):
        """A full turn emits messages-tuple (tool_call, tool_result, ai text) + values + end."""
-        ai_tc = AIMessage(content="", id="ai-1", tool_calls=[
-            {"name": "web_search", "args": {"query": "LangGraph"}, "id": "tc-1"},
-        ])
+        ai_tc = AIMessage(
+            content="",
+            id="ai-1",
+            tool_calls=[
+                {"name": "web_search", "args": {"query": "LangGraph"}, "id": "tc-1"},
+            ],
+        )
        tool_r = ToolMessage(content="LangGraph is a framework...", id="tm-1", tool_call_id="tc-1", name="web_search")
        ai_final = AIMessage(content="LangGraph is a framework for building agents.", id="ai-2")

@@ -809,13 +825,21 @@ class TestScenarioToolChain:

    def test_multi_tool_chain(self, client):
        """Agent calls bash → reads output → calls write_file → responds."""
-        ai_bash = AIMessage(content="", id="ai-1", tool_calls=[
-            {"name": "bash", "args": {"cmd": "ls /mnt/user-data/workspace"}, "id": "tc-1"},
-        ])
+        ai_bash = AIMessage(
+            content="",
+            id="ai-1",
+            tool_calls=[
+                {"name": "bash", "args": {"cmd": "ls /mnt/user-data/workspace"}, "id": "tc-1"},
+            ],
+        )
        bash_result = ToolMessage(content="README.md\nsrc/", id="tm-1", tool_call_id="tc-1", name="bash")
-        ai_write = AIMessage(content="", id="ai-2", tool_calls=[
-            {"name": "write_file", "args": {"path": "/mnt/user-data/outputs/listing.txt", "content": "README.md\nsrc/"}, "id": "tc-2"},
-        ])
+        ai_write = AIMessage(
+            content="",
+            id="ai-2",
+            tool_calls=[
+                {"name": "write_file", "args": {"path": "/mnt/user-data/outputs/listing.txt", "content": "README.md\nsrc/"}, "id": "tc-2"},
+            ],
+        )
        write_result = ToolMessage(content="File written successfully.", id="tm-2", tool_call_id="tc-2", name="write_file")
        ai_final = AIMessage(content="I listed the workspace and saved the output.", id="ai-3")

@@ -862,10 +886,13 @@ class TestScenarioFileLifecycle:

            with patch.object(DeerFlowClient, "_get_uploads_dir", return_value=uploads_dir):
                # Step 1: Upload
-                result = client.upload_files("t-lifecycle", [
-                    tmp_path / "report.txt",
-                    tmp_path / "data.csv",
-                ])
+                result = client.upload_files(
+                    "t-lifecycle",
+                    [
+                        tmp_path / "report.txt",
+                        tmp_path / "data.csv",
+                    ],
+                )
                assert result["success"] is True
                assert len(result["files"]) == 2
                assert {f["filename"] for f in result["files"]} == {"report.txt", "data.csv"}
@@ -1166,10 +1193,13 @@ class TestScenarioMemoryWorkflow:
    def test_memory_full_lifecycle(self, client):
        """get_memory → reload → get_status covers the full memory API."""
        initial_data = {"version": "1.0", "facts": [{"id": "f1", "content": "User likes Python"}]}
-        updated_data = {"version": "1.0", "facts": [
-            {"id": "f1", "content": "User likes Python"},
-            {"id": "f2", "content": "User prefers dark mode"},
-        ]}
+        updated_data = {
+            "version": "1.0",
+            "facts": [
+                {"id": "f1", "content": "User likes Python"},
+                {"id": "f2", "content": "User prefers dark mode"},
+            ],
+        }

        config = MagicMock()
        config.enabled = True
@@ -1208,9 +1238,7 @@ class TestScenarioSkillInstallAndUse:
            # Create .skill archive
            skill_src = tmp_path / "my-analyzer"
            skill_src.mkdir()
-            (skill_src / "SKILL.md").write_text(
-                "---\nname: my-analyzer\ndescription: Analyze code\nlicense: MIT\n---\nAnalysis skill"
-            )
+            (skill_src / "SKILL.md").write_text("---\nname: my-analyzer\ndescription: Analyze code\nlicense: MIT\n---\nAnalysis skill")
            archive = tmp_path / "my-analyzer.skill"
            with zipfile.ZipFile(archive, "w") as zf:
                zf.write(skill_src / "SKILL.md", "my-analyzer/SKILL.md")
@@ -1319,11 +1347,15 @@ class TestScenarioEdgeCases:

    def test_concurrent_tool_calls_in_single_message(self, client):
        """Agent produces multiple tool_calls in one AIMessage — emitted as single messages-tuple."""
-        ai = AIMessage(content="", id="ai-1", tool_calls=[
-            {"name": "web_search", "args": {"q": "a"}, "id": "tc-1"},
-            {"name": "web_search", "args": {"q": "b"}, "id": "tc-2"},
-            {"name": "bash", "args": {"cmd": "echo hi"}, "id": "tc-3"},
-        ])
+        ai = AIMessage(
+            content="",
+            id="ai-1",
+            tool_calls=[
+                {"name": "web_search", "args": {"q": "a"}, "id": "tc-1"},
+                {"name": "web_search", "args": {"q": "b"}, "id": "tc-2"},
+                {"name": "bash", "args": {"cmd": "echo hi"}, "id": "tc-3"},
+            ],
+        )
        chunks = [{"messages": [ai]}]
        agent = _make_agent_mock(chunks)

@@ -1367,6 +1399,7 @@ class TestScenarioEdgeCases:
 # Gateway conformance — validate client output against Gateway Pydantic models
 # ---------------------------------------------------------------------------

+
 class TestGatewayConformance:
    """Validate that DeerFlowClient return dicts conform to Gateway Pydantic response models.

@@ -1441,9 +1474,7 @@ class TestGatewayConformance:
    def test_install_skill(self, client, tmp_path):
        skill_dir = tmp_path / "my-skill"
        skill_dir.mkdir()
-        (skill_dir / "SKILL.md").write_text(
-            "---\nname: my-skill\ndescription: A test skill\n---\nBody\n"
-        )
+        (skill_dir / "SKILL.md").write_text("---\nname: my-skill\ndescription: A test skill\n---\nBody\n")

        archive = tmp_path / "my-skill.skill"
        with zipfile.ZipFile(archive, "w") as zf:
--- a/backend/tests/test_infoquest_client.py
+++ b/backend/tests/test_infoquest_client.py
@@ -125,7 +125,7 @@ class TestInfoQuestClient:

    def test_clean_results_with_image_search(self):
        """Test clean_results_with_image_search method with sample raw results."""
-        raw_results = [{"content": {"results": {"images_results": [{"image_url": "https://example.com/image1.jpg", "thumbnail_url": "https://example.com/thumb1.jpg","url": "https://example.com/page1"}]}}}]
+        raw_results = [{"content": {"results": {"images_results": [{"image_url": "https://example.com/image1.jpg", "thumbnail_url": "https://example.com/thumb1.jpg", "url": "https://example.com/page1"}]}}}]
        cleaned = InfoQuestClient.clean_results_with_image_search(raw_results)

        assert len(cleaned) == 1
@@ -181,4 +181,4 @@ class TestInfoQuestClient:
        client = InfoQuestClient()
        result = client.web_search("test query")

-        assert "Error" in result
+        assert "Error" in result
--- a/backend/tests/test_memory_upload_filtering.py
+++ b/backend/tests/test_memory_upload_filtering.py
@@ -16,14 +16,7 @@ from src.agents.middlewares.memory_middleware import _filter_messages_for_memory
 # Helpers
 # ---------------------------------------------------------------------------

-_UPLOAD_BLOCK = (
-    "<uploaded_files>\n"
-    "The following files have been uploaded and are available for use:\n\n"
-    "- filename: secret.txt\n"
-    "  path: /mnt/user-data/uploads/abc123/secret.txt\n"
-    "  size: 42 bytes\n"
-    "</uploaded_files>"
-)
+_UPLOAD_BLOCK = "<uploaded_files>\nThe following files have been uploaded and are available for use:\n\n- filename: secret.txt\n  path: /mnt/user-data/uploads/abc123/secret.txt\n  size: 42 bytes\n</uploaded_files>"


 def _human(text: str) -> HumanMessage:
@@ -103,7 +96,7 @@ class TestFilterMessagesForMemory:
        msgs = [
            _human("Hello, how are you?"),
            _ai("I'm doing well, thank you!"),
-            _human(_UPLOAD_BLOCK),           # upload-only → dropped
+            _human(_UPLOAD_BLOCK),  # upload-only → dropped
            _ai("I read the uploaded file."),  # paired AI → dropped
            _human("What is 2 + 2?"),
            _ai("4"),
@@ -122,9 +115,11 @@ class TestFilterMessagesForMemory:

    def test_multimodal_content_list_handled(self):
        """Human messages with list-style content (multimodal) are handled."""
-        msg = HumanMessage(content=[
-            {"type": "text", "text": _UPLOAD_BLOCK},
-        ])
+        msg = HumanMessage(
+            content=[
+                {"type": "text", "text": _UPLOAD_BLOCK},
+            ]
+        )
        msgs = [msg, _ai("Done.")]
        result = _filter_messages_for_memory(msgs)
        assert result == []
@@ -134,9 +129,7 @@ class TestFilterMessagesForMemory:
        combined = _UPLOAD_BLOCK + "\n\nSummarise the file please."
        msgs = [_human(combined), _ai("It says hello.")]
        result = _filter_messages_for_memory(msgs)
-        all_content = " ".join(
-            m.content for m in result if isinstance(m.content, str)
-        )
+        all_content = " ".join(m.content for m in result if isinstance(m.content, str))
        assert "/mnt/user-data/uploads/" not in all_content
        assert "<uploaded_files>" not in all_content

@@ -157,11 +150,7 @@ class TestStripUploadMentionsFromMemory:
    # --- summaries ---

    def test_upload_event_sentence_removed_from_summary(self):
-        mem = self._make_memory(
-            "User is interested in AI. "
-            "User uploaded a test file for verification purposes. "
-            "User prefers concise answers."
-        )
+        mem = self._make_memory("User is interested in AI. User uploaded a test file for verification purposes. User prefers concise answers.")
        result = _strip_upload_mentions_from_memory(mem)
        summary = result["user"]["topOfMind"]["summary"]
        assert "uploaded a test file" not in summary
@@ -169,11 +158,7 @@ class TestStripUploadMentionsFromMemory:
        assert "User prefers concise answers" in summary

    def test_upload_path_sentence_removed_from_summary(self):
-        mem = self._make_memory(
-            "User uses Python. "
-            "User uploaded file to /mnt/user-data/uploads/tid/data.csv. "
-            "User likes clean code."
-        )
+        mem = self._make_memory("User uses Python. User uploaded file to /mnt/user-data/uploads/tid/data.csv. User likes clean code.")
        result = _strip_upload_mentions_from_memory(mem)
        summary = result["user"]["topOfMind"]["summary"]
        assert "/mnt/user-data/uploads/" not in summary
@@ -193,10 +178,7 @@ class TestStripUploadMentionsFromMemory:

    def test_uploading_a_test_file_removed(self):
        """'uploading a test file' (with intervening words) must be caught."""
-        mem = self._make_memory(
-            "User conducted a hands-on test by uploading a test file titled "
-            "'test_deerflow_memory_bug.txt'. User is also learning Python."
-        )
+        mem = self._make_memory("User conducted a hands-on test by uploading a test file titled 'test_deerflow_memory_bug.txt'. User is also learning Python.")
        result = _strip_upload_mentions_from_memory(mem)
        summary = result["user"]["topOfMind"]["summary"]
        assert "test_deerflow_memory_bug.txt" not in summary
--- a/backend/tests/test_present_file_tool_core_logic.py
+++ b/backend/tests/test_present_file_tool_core_logic.py
@@ -3,9 +3,7 @@
 import importlib
 from types import SimpleNamespace

-present_file_tool_module = importlib.import_module(
-    "src.tools.builtins.present_file_tool"
-)
+present_file_tool_module = importlib.import_module("src.tools.builtins.present_file_tool")


 def _make_runtime(outputs_path: str) -> SimpleNamespace:
@@ -40,9 +38,7 @@ def test_present_files_keeps_virtual_outputs_path(tmp_path, monkeypatch):
    monkeypatch.setattr(
        present_file_tool_module,
        "get_paths",
-        lambda: SimpleNamespace(
-            resolve_virtual_path=lambda thread_id, path: artifact_path
-        ),
+        lambda: SimpleNamespace(resolve_virtual_path=lambda thread_id, path: artifact_path),
    )

    result = present_file_tool_module.present_file_tool.func(
@@ -69,7 +65,4 @@ def test_present_files_rejects_paths_outside_outputs(tmp_path):
    )

    assert "artifacts" not in result.update
-    assert (
-        result.update["messages"][0].content
-        == f"Error: Only files in /mnt/user-data/outputs can be presented: {leaked_path}"
-    )
+    assert result.update["messages"][0].content == f"Error: Only files in /mnt/user-data/outputs can be presented: {leaked_path}"
--- a/backend/tests/test_reflection_resolvers.py
+++ b/backend/tests/test_reflection_resolvers.py
@@ -8,6 +8,7 @@ from src.reflection.resolvers import resolve_variable

 def test_resolve_variable_reports_install_hint_for_missing_google_provider(monkeypatch: pytest.MonkeyPatch):
    """Missing google provider should return actionable install guidance."""
+
    def fake_import_module(module_path: str):
        raise ModuleNotFoundError(f"No module named '{module_path}'", name=module_path)

@@ -38,6 +39,8 @@ def test_resolve_variable_reports_install_hint_for_missing_google_transitive_dep
    message = str(exc_info.value)
    # Even when a transitive dependency is missing, the hint should still point to the provider package.
    assert "uv add langchain-google-genai" in message
+
+
 def test_resolve_variable_invalid_path_format():
    """Invalid variable path should fail with format guidance."""
    with pytest.raises(ImportError) as exc_info:
--- a/backend/tests/test_suggestions_router.py
+++ b/backend/tests/test_suggestions_router.py
@@ -5,22 +5,22 @@ from src.gateway.routers import suggestions


 def test_strip_markdown_code_fence_removes_wrapping():
-    text = "```json\n[\"a\"]\n```"
-    assert suggestions._strip_markdown_code_fence(text) == "[\"a\"]"
+    text = '```json\n["a"]\n```'
+    assert suggestions._strip_markdown_code_fence(text) == '["a"]'


 def test_strip_markdown_code_fence_no_fence_keeps_content():
-    text = "  [\"a\"]  "
-    assert suggestions._strip_markdown_code_fence(text) == "[\"a\"]"
+    text = '  ["a"]  '
+    assert suggestions._strip_markdown_code_fence(text) == '["a"]'


 def test_parse_json_string_list_filters_invalid_items():
-    text = "```json\n[\"a\", \" \", 1, \"b\"]\n```"
+    text = '```json\n["a", " ", 1, "b"]\n```'
    assert suggestions._parse_json_string_list(text) == ["a", "b"]


 def test_parse_json_string_list_rejects_non_list():
-    text = "{\"a\": 1}"
+    text = '{"a": 1}'
    assert suggestions._parse_json_string_list(text) is None


@@ -43,7 +43,7 @@ def test_generate_suggestions_parses_and_limits(monkeypatch):
        model_name=None,
    )
    fake_model = MagicMock()
-    fake_model.invoke.return_value = MagicMock(content="```json\n[\"Q1\", \"Q2\", \"Q3\", \"Q4\"]\n```")
+    fake_model.invoke.return_value = MagicMock(content='```json\n["Q1", "Q2", "Q3", "Q4"]\n```')
    monkeypatch.setattr(suggestions, "create_chat_model", lambda **kwargs: fake_model)

    result = asyncio.run(suggestions.generate_suggestions("t1", req))
@@ -63,4 +63,4 @@ def test_generate_suggestions_returns_empty_on_model_error(monkeypatch):

    result = asyncio.run(suggestions.generate_suggestions("t1", req))

-    assert result.suggestions == []
+    assert result.suggestions == []
--- a/backend/tests/test_uploads_router.py
+++ b/backend/tests/test_uploads_router.py
@@ -21,7 +21,6 @@ def test_upload_files_writes_thread_storage_and_skips_local_sandbox_sync(tmp_pat
        patch.object(uploads, "get_uploads_dir", return_value=thread_uploads_dir),
        patch.object(uploads, "get_sandbox_provider", return_value=provider),
    ):
-
        file = UploadFile(filename="notes.txt", file=BytesIO(b"hello uploads"))
        result = asyncio.run(uploads.upload_files("thread-local", files=[file]))

@@ -52,7 +51,6 @@ def test_upload_files_syncs_non_local_sandbox_and_marks_markdown_file(tmp_path):
        patch.object(uploads, "get_sandbox_provider", return_value=provider),
        patch.object(uploads, "convert_file_to_markdown", AsyncMock(side_effect=fake_convert)),
    ):
-
        file = UploadFile(filename="report.pdf", file=BytesIO(b"pdf-bytes"))
        result = asyncio.run(uploads.upload_files("thread-aio", files=[file]))

--- a/backend/uv.lock
+++ b/backend/uv.lock
@@ -662,12 +662,17 @@ dependencies = [
    { name = "langgraph-checkpoint-sqlite" },
    { name = "langgraph-cli" },
    { name = "langgraph-runtime-inmem" },
+    { name = "langgraph-sdk" },
+    { name = "lark-oapi" },
+    { name = "markdown-to-mrkdwn" },
    { name = "markdownify" },
    { name = "markitdown", extra = ["all", "xlsx"] },
    { name = "pydantic" },
    { name = "python-multipart" },
+    { name = "python-telegram-bot" },
    { name = "pyyaml" },
    { name = "readabilipy" },
+    { name = "slack-sdk" },
    { name = "sse-starlette" },
    { name = "tavily-python" },
    { name = "tiktoken" },
@@ -701,12 +706,17 @@ requires-dist = [
    { name = "langgraph-checkpoint-sqlite", specifier = ">=3.0.3" },
    { name = "langgraph-cli", specifier = ">=0.4.14" },
    { name = "langgraph-runtime-inmem", specifier = ">=0.22.1" },
+    { name = "langgraph-sdk", specifier = ">=0.1.51" },
+    { name = "lark-oapi", specifier = ">=1.4.0" },
+    { name = "markdown-to-mrkdwn", specifier = ">=0.3.1" },
    { name = "markdownify", specifier = ">=1.2.2" },
    { name = "markitdown", extras = ["all", "xlsx"], specifier = ">=0.0.1a2" },
    { name = "pydantic", specifier = ">=2.12.5" },
    { name = "python-multipart", specifier = ">=0.0.20" },
+    { name = "python-telegram-bot", specifier = ">=21.0" },
    { name = "pyyaml", specifier = ">=6.0.3" },
    { name = "readabilipy", specifier = ">=0.3.0" },
+    { name = "slack-sdk", specifier = ">=3.33.0" },
    { name = "sse-starlette", specifier = ">=2.1.0" },
    { name = "tavily-python", specifier = ">=0.7.17" },
    { name = "tiktoken", specifier = ">=0.8.0" },
@@ -1715,6 +1725,21 @@ otel = [
    { name = "opentelemetry-sdk" },
 ]

+[[package]]
+name = "lark-oapi"
+version = "1.5.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "httpx" },
+    { name = "pycryptodome" },
+    { name = "requests" },
+    { name = "requests-toolbelt" },
+    { name = "websockets" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/bf/ff/2ece5d735ebfa2af600a53176f2636ae47af2bf934e08effab64f0d1e047/lark_oapi-1.5.3-py3-none-any.whl", hash = "sha256:fda6b32bb38d21b6bdaae94979c600b94c7c521e985adade63a54e4b3e20cc36", size = 6993016, upload-time = "2026-01-27T08:21:49.307Z" },
+]
+
 [[package]]
 name = "lxml"
 version = "6.0.2"
@@ -1825,6 +1850,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/ca/54/2e39566a131b13f6d8d193f974cb6a34e81bb7cc2fa6f7e03de067b36588/mammoth-1.11.0-py2.py3-none-any.whl", hash = "sha256:c077ab0d450bd7c0c6ecd529a23bf7e0fa8190c929e28998308ff4eada3f063b", size = 54752, upload-time = "2025-09-19T10:35:18.699Z" },
 ]

+[[package]]
+name = "markdown-to-mrkdwn"
+version = "0.3.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/17/8e/f2c62a88097425b0dba3a8699d13154b4c5888b989ffaf6419c10058b338/markdown_to_mrkdwn-0.3.1.tar.gz", hash = "sha256:25f5c095516f8ad956c88c5dab75493aadfaa02e51e3c84459490058a8ca840b", size = 14191, upload-time = "2026-01-05T14:37:29.276Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/52/92/ce0a08fb9769a13be550a7079c3409300ca6eb14ccc9038f67ac44deeef4/markdown_to_mrkdwn-0.3.1-py3-none-any.whl", hash = "sha256:5a6d08f1eaa08aea66953ef0eba206e0bb244d5c62880c76d1e3a11ee46cd3f0", size = 13592, upload-time = "2026-01-05T14:37:28.21Z" },
+]
+
 [[package]]
 name = "markdownify"
 version = "1.2.2"
@@ -2666,6 +2700,36 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/a0/e3/59cd50310fc9b59512193629e1984c1f95e5c8ae6e5d8c69532ccc65a7fe/pycparser-2.23-py3-none-any.whl", hash = "sha256:e5c6e8d3fbad53479cab09ac03729e0a9faf2bee3db8208a550daf5af81a5934", size = 118140, upload-time = "2025-09-09T13:23:46.651Z" },
 ]

+[[package]]
+name = "pycryptodome"
+version = "3.23.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/8e/a6/8452177684d5e906854776276ddd34eca30d1b1e15aa1ee9cefc289a33f5/pycryptodome-3.23.0.tar.gz", hash = "sha256:447700a657182d60338bab09fdb27518f8856aecd80ae4c6bdddb67ff5da44ef", size = 4921276, upload-time = "2025-05-17T17:21:45.242Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/04/5d/bdb09489b63cd34a976cc9e2a8d938114f7a53a74d3dd4f125ffa49dce82/pycryptodome-3.23.0-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:0011f7f00cdb74879142011f95133274741778abba114ceca229adbf8e62c3e4", size = 2495152, upload-time = "2025-05-17T17:20:20.833Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/ce/7840250ed4cc0039c433cd41715536f926d6e86ce84e904068eb3244b6a6/pycryptodome-3.23.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:90460fc9e088ce095f9ee8356722d4f10f86e5be06e2354230a9880b9c549aae", size = 1639348, upload-time = "2025-05-17T17:20:23.171Z" },
+    { url = "https://files.pythonhosted.org/packages/ee/f0/991da24c55c1f688d6a3b5a11940567353f74590734ee4a64294834ae472/pycryptodome-3.23.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4764e64b269fc83b00f682c47443c2e6e85b18273712b98aa43bcb77f8570477", size = 2184033, upload-time = "2025-05-17T17:20:25.424Z" },
+    { url = "https://files.pythonhosted.org/packages/54/16/0e11882deddf00f68b68dd4e8e442ddc30641f31afeb2bc25588124ac8de/pycryptodome-3.23.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:eb8f24adb74984aa0e5d07a2368ad95276cf38051fe2dc6605cbcf482e04f2a7", size = 2270142, upload-time = "2025-05-17T17:20:27.808Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/fc/4347fea23a3f95ffb931f383ff28b3f7b1fe868739182cb76718c0da86a1/pycryptodome-3.23.0-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d97618c9c6684a97ef7637ba43bdf6663a2e2e77efe0f863cce97a76af396446", size = 2309384, upload-time = "2025-05-17T17:20:30.765Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/d9/c5261780b69ce66d8cfab25d2797bd6e82ba0241804694cd48be41add5eb/pycryptodome-3.23.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:9a53a4fe5cb075075d515797d6ce2f56772ea7e6a1e5e4b96cf78a14bac3d265", size = 2183237, upload-time = "2025-05-17T17:20:33.736Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/6f/3af2ffedd5cfa08c631f89452c6648c4d779e7772dfc388c77c920ca6bbf/pycryptodome-3.23.0-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:763d1d74f56f031788e5d307029caef067febf890cd1f8bf61183ae142f1a77b", size = 2343898, upload-time = "2025-05-17T17:20:36.086Z" },
+    { url = "https://files.pythonhosted.org/packages/9a/dc/9060d807039ee5de6e2f260f72f3d70ac213993a804f5e67e0a73a56dd2f/pycryptodome-3.23.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:954af0e2bd7cea83ce72243b14e4fb518b18f0c1649b576d114973e2073b273d", size = 2269197, upload-time = "2025-05-17T17:20:38.414Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/34/e6c8ca177cb29dcc4967fef73f5de445912f93bd0343c9c33c8e5bf8cde8/pycryptodome-3.23.0-cp313-cp313t-win32.whl", hash = "sha256:257bb3572c63ad8ba40b89f6fc9d63a2a628e9f9708d31ee26560925ebe0210a", size = 1768600, upload-time = "2025-05-17T17:20:40.688Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/1d/89756b8d7ff623ad0160f4539da571d1f594d21ee6d68be130a6eccb39a4/pycryptodome-3.23.0-cp313-cp313t-win_amd64.whl", hash = "sha256:6501790c5b62a29fcb227bd6b62012181d886a767ce9ed03b303d1f22eb5c625", size = 1799740, upload-time = "2025-05-17T17:20:42.413Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/61/35a64f0feaea9fd07f0d91209e7be91726eb48c0f1bfc6720647194071e4/pycryptodome-3.23.0-cp313-cp313t-win_arm64.whl", hash = "sha256:9a77627a330ab23ca43b48b130e202582e91cc69619947840ea4d2d1be21eb39", size = 1703685, upload-time = "2025-05-17T17:20:44.388Z" },
+    { url = "https://files.pythonhosted.org/packages/db/6c/a1f71542c969912bb0e106f64f60a56cc1f0fabecf9396f45accbe63fa68/pycryptodome-3.23.0-cp37-abi3-macosx_10_9_universal2.whl", hash = "sha256:187058ab80b3281b1de11c2e6842a357a1f71b42cb1e15bce373f3d238135c27", size = 2495627, upload-time = "2025-05-17T17:20:47.139Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/4e/a066527e079fc5002390c8acdd3aca431e6ea0a50ffd7201551175b47323/pycryptodome-3.23.0-cp37-abi3-macosx_10_9_x86_64.whl", hash = "sha256:cfb5cd445280c5b0a4e6187a7ce8de5a07b5f3f897f235caa11f1f435f182843", size = 1640362, upload-time = "2025-05-17T17:20:50.392Z" },
+    { url = "https://files.pythonhosted.org/packages/50/52/adaf4c8c100a8c49d2bd058e5b551f73dfd8cb89eb4911e25a0c469b6b4e/pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:67bd81fcbe34f43ad9422ee8fd4843c8e7198dd88dd3d40e6de42ee65fbe1490", size = 2182625, upload-time = "2025-05-17T17:20:52.866Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/e9/a09476d436d0ff1402ac3867d933c61805ec2326c6ea557aeeac3825604e/pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c8987bd3307a39bc03df5c8e0e3d8be0c4c3518b7f044b0f4c15d1aa78f52575", size = 2268954, upload-time = "2025-05-17T17:20:55.027Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/c5/ffe6474e0c551d54cab931918127c46d70cab8f114e0c2b5a3c071c2f484/pycryptodome-3.23.0-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:aa0698f65e5b570426fc31b8162ed4603b0c2841cbb9088e2b01641e3065915b", size = 2308534, upload-time = "2025-05-17T17:20:57.279Z" },
+    { url = "https://files.pythonhosted.org/packages/18/28/e199677fc15ecf43010f2463fde4c1a53015d1fe95fb03bca2890836603a/pycryptodome-3.23.0-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:53ecbafc2b55353edcebd64bf5da94a2a2cdf5090a6915bcca6eca6cc452585a", size = 2181853, upload-time = "2025-05-17T17:20:59.322Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/ea/4fdb09f2165ce1365c9eaefef36625583371ee514db58dc9b65d3a255c4c/pycryptodome-3.23.0-cp37-abi3-musllinux_1_2_i686.whl", hash = "sha256:156df9667ad9f2ad26255926524e1c136d6664b741547deb0a86a9acf5ea631f", size = 2342465, upload-time = "2025-05-17T17:21:03.83Z" },
+    { url = "https://files.pythonhosted.org/packages/22/82/6edc3fc42fe9284aead511394bac167693fb2b0e0395b28b8bedaa07ef04/pycryptodome-3.23.0-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:dea827b4d55ee390dc89b2afe5927d4308a8b538ae91d9c6f7a5090f397af1aa", size = 2267414, upload-time = "2025-05-17T17:21:06.72Z" },
+    { url = "https://files.pythonhosted.org/packages/59/fe/aae679b64363eb78326c7fdc9d06ec3de18bac68be4b612fc1fe8902693c/pycryptodome-3.23.0-cp37-abi3-win32.whl", hash = "sha256:507dbead45474b62b2bbe318eb1c4c8ee641077532067fec9c1aa82c31f84886", size = 1768484, upload-time = "2025-05-17T17:21:08.535Z" },
+    { url = "https://files.pythonhosted.org/packages/54/2f/e97a1b8294db0daaa87012c24a7bb714147c7ade7656973fd6c736b484ff/pycryptodome-3.23.0-cp37-abi3-win_amd64.whl", hash = "sha256:c75b52aacc6c0c260f204cbdd834f76edc9fb0d8e0da9fbf8352ef58202564e2", size = 1799636, upload-time = "2025-05-17T17:21:10.393Z" },
+    { url = "https://files.pythonhosted.org/packages/18/3d/f9441a0d798bf2b1e645adc3265e55706aead1255ccdad3856dbdcffec14/pycryptodome-3.23.0-cp37-abi3-win_arm64.whl", hash = "sha256:11eeeb6917903876f134b56ba11abe95c0b0fd5e3330def218083c7d98bbcb3c", size = 1703675, upload-time = "2025-05-17T17:21:13.146Z" },
+]
+
 [[package]]
 name = "pydantic"
 version = "2.12.5"
@@ -2897,6 +2961,19 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/d9/4f/00be2196329ebbff56ce564aa94efb0fbc828d00de250b1980de1a34ab49/python_pptx-1.0.2-py3-none-any.whl", hash = "sha256:160838e0b8565a8b1f67947675886e9fea18aa5e795db7ae531606d68e785cba", size = 472788, upload-time = "2024-08-07T17:33:28.192Z" },
 ]

+[[package]]
+name = "python-telegram-bot"
+version = "22.6"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "httpcore", marker = "python_full_version >= '3.14'" },
+    { name = "httpx" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/cd/9b/8df90c85404166a6631e857027866263adb27440d8af1dbeffbdc4f0166c/python_telegram_bot-22.6.tar.gz", hash = "sha256:50ae8cc10f8dff01445628687951020721f37956966b92a91df4c1bf2d113742", size = 1503761, upload-time = "2026-01-24T13:57:00.269Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/13/97/7298f0e1afe3a1ae52ff4c5af5087ed4de319ea73eb3b5c8c4dd4e76e708/python_telegram_bot-22.6-py3-none-any.whl", hash = "sha256:e598fe171c3dde2dfd0f001619ee9110eece66761a677b34719fb18934935ce0", size = 737267, upload-time = "2026-01-24T13:56:58.06Z" },
+]
+
 [[package]]
 name = "pytz"
 version = "2026.1.post1"
@@ -3252,6 +3329,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/b7/ce/149a00dd41f10bc29e5921b496af8b574d8413afcd5e30dfa0ed46c2cc5e/six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274", size = 11050, upload-time = "2024-12-04T17:35:26.475Z" },
 ]

+[[package]]
+name = "slack-sdk"
+version = "3.40.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/3a/18/784859b33a3f9c8cdaa1eda4115eb9fe72a0a37304718887d12991eeb2fd/slack_sdk-3.40.1.tar.gz", hash = "sha256:a215333bc251bc90abf5f5110899497bf61a3b5184b6d9ee35d73ebf09ec3fd0", size = 250379, upload-time = "2026-02-18T22:11:01.819Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/6e/e1/bb81f93c9f403e3b573c429dd4838ec9b44e4ef35f3b0759eb49557ab6e3/slack_sdk-3.40.1-py2.py3-none-any.whl", hash = "sha256:cd8902252979aa248092b0d77f3a9ea3cc605bc5d53663ad728e892e26e14a65", size = 313687, upload-time = "2026-02-18T22:11:00.027Z" },
+]
+
 [[package]]
 name = "sniffio"
 version = "1.3.1"
--- a/config.example.yaml
+++ b/config.example.yaml
@@ -378,3 +378,31 @@ memory:
 # checkpointer:
 #   type: postgres
 #   connection_string: postgresql://user:password@localhost:5432/deerflow
+
+# ============================================================================
+# IM Channels Configuration
+# ============================================================================
+# Connect DeerFlow to external messaging platforms.
+# All channels use outbound connections (WebSocket or polling) — no public IP required.
+
+# channels:
+#   # LangGraph Server URL for thread/message management (default: http://localhost:2024)
+#   langgraph_url: http://localhost:2024
+#   # Gateway API URL for auxiliary queries like /models, /memory (default: http://localhost:8001)
+#   gateway_url: http://localhost:8001
+#
+#   feishu:
+#     enabled: false
+#     app_id: $FEISHU_APP_ID
+#     app_secret: $FEISHU_APP_SECRET
+#
+#   slack:
+#     enabled: false
+#     bot_token: $SLACK_BOT_TOKEN     # xoxb-...
+#     app_token: $SLACK_APP_TOKEN     # xapp-... (Socket Mode)
+#     allowed_users: []               # empty = allow all
+#
+#   telegram:
+#     enabled: false
+#     bot_token: $TELEGRAM_BOT_TOKEN
+#     allowed_users: []               # empty = allow all
--- a/skills/public/skill-creator/SKILL.md
+++ b/skills/public/skill-creator/SKILL.md
@@ -1,356 +1,485 @@
 ---
 name: skill-creator
-description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
-license: Complete terms in LICENSE.txt
+description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
 ---

 # Skill Creator

-This skill provides guidance for creating effective skills.
+A skill for creating new skills and iteratively improving them.

-## About Skills
+At a high level, the process of creating a skill goes like this:

-Skills are modular, self-contained packages that extend Claude's capabilities by providing
-specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific
-domains or tasks—they transform Claude from a general-purpose agent into a specialized agent
-equipped with procedural knowledge that no model can fully possess.
+- Decide what you want the skill to do and roughly how it should do it
+- Write a draft of the skill
+- Create a few test prompts and run claude-with-access-to-the-skill on them
+- Help the user evaluate the results both qualitatively and quantitatively
+  - While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
+  - Use the `eval-viewer/generate_review.py` script to show the user the results for them to look at, and also let them look at the quantitative metrics
+- Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
+- Repeat until you're satisfied
+- Expand the test set and try again at larger scale

-### What Skills Provide
+Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.

-1. Specialized workflows - Multi-step procedures for specific domains
-2. Tool integrations - Instructions for working with specific file formats or APIs
-3. Domain expertise - Company-specific knowledge, schemas, business logic
-4. Bundled resources - Scripts, references, and assets for complex and repetitive tasks
+On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

-## Core Principles
+Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

-### Concise is Key
+Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.

-The context window is a public good. Skills share the context window with everything else Claude needs: system prompt, conversation history, other Skills' metadata, and the actual user request.
+Cool? Cool.

-**Default assumption: Claude is already very smart.** Only add context Claude doesn't already have. Challenge each piece of information: "Does Claude really need this explanation?" and "Does this paragraph justify its token cost?"
+## Communicating with the user

-Prefer concise examples over verbose explanations.
+The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.

-### Set Appropriate Degrees of Freedom
+So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:

-Match the level of specificity to the task's fragility and variability:
+- "evaluation" and "benchmark" are borderline, but OK
+- for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them

-**High freedom (text-based instructions)**: Use when multiple approaches are valid, decisions depend on context, or heuristics guide the approach.
+It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.

-**Medium freedom (pseudocode or scripts with parameters)**: Use when a preferred pattern exists, some variation is acceptable, or configuration affects behavior.
+---

-**Low freedom (specific scripts, few parameters)**: Use when operations are fragile and error-prone, consistency is critical, or a specific sequence must be followed.
+## Creating a skill

-Think of Claude as exploring a path: a narrow bridge with cliffs needs specific guardrails (low freedom), while an open field allows many routes (high freedom).
+### Capture Intent

-### Anatomy of a Skill
+Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.

-Every skill consists of a required SKILL.md file and optional bundled resources:
+1. What should this skill enable Claude to do?
+2. When should this skill trigger? (what user phrases/contexts)
+3. What's the expected output format?
+4. Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
+
+### Interview and Research
+
+Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
+
+Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
+
+### Write the SKILL.md
+
+Based on the user interview, fill in these components:
+
+- **name**: Skill identifier
+- **description**: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
+- **compatibility**: Required tools, dependencies (optional, rarely needed)
+- **the rest of the skill :)**
+
+### Skill Writing Guide
+
+#### Anatomy of a Skill

 ```
 skill-name/
 ├── SKILL.md (required)
-│   ├── YAML frontmatter metadata (required)
-│   │   ├── name: (required)
-│   │   └── description: (required)
-│   └── Markdown instructions (required)
+│   ├── YAML frontmatter (name, description required)
+│   └── Markdown instructions
 └── Bundled Resources (optional)
-    ├── scripts/          - Executable code (Python/Bash/etc.)
-    ├── references/       - Documentation intended to be loaded into context as needed
-    └── assets/           - Files used in output (templates, icons, fonts, etc.)
+    ├── scripts/    - Executable code for deterministic/repetitive tasks
+    ├── references/ - Docs loaded into context as needed
+    └── assets/     - Files used in output (templates, icons, fonts)
 ```

-#### SKILL.md (required)
+#### Progressive Disclosure

-Every SKILL.md consists of:
+Skills use a three-level loading system:
+1. **Metadata** (name + description) - Always in context (~100 words)
+2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
+3. **Bundled resources** - As needed (unlimited, scripts can execute without loading)

- **Frontmatter** (YAML): Contains `name` and `description` fields. These are the only fields that Claude reads to determine when the skill gets used, thus it is very important to be clear and comprehensive in describing what the skill is, and when it should be used.
- **Body** (Markdown): Instructions and guidance for using the skill. Only loaded AFTER the skill triggers (if at all).
+These word counts are approximate and you can feel free to go longer if needed.

-#### Bundled Resources (optional)
-
-##### Scripts (`scripts/`)
-
-Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten.
-
- **When to include**: When the same code is being rewritten repeatedly or deterministic reliability is needed
- **Example**: `scripts/rotate_pdf.py` for PDF rotation tasks
- **Benefits**: Token efficient, deterministic, may be executed without loading into context
- **Note**: Scripts may still need to be read by Claude for patching or environment-specific adjustments
-
-##### References (`references/`)
-
-Documentation and reference material intended to be loaded as needed into context to inform Claude's process and thinking.
-
- **When to include**: For documentation that Claude should reference while working
- **Examples**: `references/finance.md` for financial schemas, `references/mnda.md` for company NDA template, `references/policies.md` for company policies, `references/api_docs.md` for API specifications
- **Use cases**: Database schemas, API documentation, domain knowledge, company policies, detailed workflow guides
- **Benefits**: Keeps SKILL.md lean, loaded only when Claude determines it's needed
- **Best practice**: If files are large (>10k words), include grep search patterns in SKILL.md
- **Avoid duplication**: Information should live in either SKILL.md or references files, not both. Prefer references files for detailed information unless it's truly core to the skill—this keeps SKILL.md lean while making information discoverable without hogging the context window. Keep only essential procedural instructions and workflow guidance in SKILL.md; move detailed reference material, schemas, and examples to references files.
-
-##### Assets (`assets/`)
-
-Files not intended to be loaded into context, but rather used within the output Claude produces.
-
- **When to include**: When the skill needs files that will be used in the final output
- **Examples**: `assets/logo.png` for brand assets, `assets/slides.pptx` for PowerPoint templates, `assets/frontend-template/` for HTML/React boilerplate, `assets/font.ttf` for typography
- **Use cases**: Templates, images, icons, boilerplate code, fonts, sample documents that get copied or modified
- **Benefits**: Separates output resources from documentation, enables Claude to use files without loading them into context
-
-#### What to Not Include in a Skill
-
-A skill should only contain essential files that directly support its functionality. Do NOT create extraneous documentation or auxiliary files, including:
-
- README.md
- INSTALLATION_GUIDE.md
- QUICK_REFERENCE.md
- CHANGELOG.md
- etc.
-
-The skill should only contain the information needed for an AI agent to do the job at hand. It should not contain auxilary context about the process that went into creating it, setup and testing procedures, user-facing documentation, etc. Creating additional documentation files just adds clutter and confusion.
-
-### Progressive Disclosure Design Principle
-
-Skills use a three-level loading system to manage context efficiently:
-
-1. **Metadata (name + description)** - Always in context (~100 words)
-2. **SKILL.md body** - When skill triggers (<5k words)
-3. **Bundled resources** - As needed by Claude (Unlimited because scripts can be executed without reading into context window)
-
-#### Progressive Disclosure Patterns
-
-Keep SKILL.md body to the essentials and under 500 lines to minimize context bloat. Split content into separate files when approaching this limit. When splitting out content into other files, it is very important to reference them from SKILL.md and describe clearly when to read them, to ensure the reader of the skill knows they exist and when to use them.
-
-**Key principle:** When a skill supports multiple variations, frameworks, or options, keep only the core workflow and selection guidance in SKILL.md. Move variant-specific details (patterns, examples, configuration) into separate reference files.
-
-**Pattern 1: High-level guide with references**
-
-```markdown
-# PDF Processing
-
-## Quick start
-
-Extract text with pdfplumber:
-[code example]
-
-## Advanced features
-
- **Form filling**: See [FORMS.md](FORMS.md) for complete guide
- **API reference**: See [REFERENCE.md](REFERENCE.md) for all methods
- **Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns
-```
-
-Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed.
-
-**Pattern 2: Domain-specific organization**
-
-For Skills with multiple domains, organize content by domain to avoid loading irrelevant context:
-
-```
-bigquery-skill/
-├── SKILL.md (overview and navigation)
-└── reference/
-    ├── finance.md (revenue, billing metrics)
-    ├── sales.md (opportunities, pipeline)
-    ├── product.md (API usage, features)
-    └── marketing.md (campaigns, attribution)
-```
-
-When a user asks about sales metrics, Claude only reads sales.md.
-
-Similarly, for skills supporting multiple frameworks or variants, organize by variant:
+**Key patterns:**
+- Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
+- Reference files clearly from SKILL.md with guidance on when to read them
+- For large reference files (>300 lines), include a table of contents

+**Domain organization**: When a skill supports multiple domains/frameworks, organize by variant:
 ```
 cloud-deploy/
-├── SKILL.md (workflow + provider selection)
+├── SKILL.md (workflow + selection)
 └── references/
-    ├── aws.md (AWS deployment patterns)
-    ├── gcp.md (GCP deployment patterns)
-    └── azure.md (Azure deployment patterns)
+    ├── aws.md
+    ├── gcp.md
+    └── azure.md
 ```
+Claude reads only the relevant reference file.

-When the user chooses AWS, Claude only reads aws.md.
+#### Principle of Lack of Surprise

-**Pattern 3: Conditional details**
+This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.

-Show basic content, link to advanced content:
+#### Writing Patterns

+Prefer using the imperative form in instructions.
+
+**Defining output formats** - You can do it like this:
 ```markdown
-# DOCX Processing
-
-## Creating documents
-
-Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).
-
-## Editing documents
-
-For simple edits, modify the XML directly.
-
-**For tracked changes**: See [REDLINING.md](REDLINING.md)
-**For OOXML details**: See [OOXML.md](OOXML.md)
+## Report structure
+ALWAYS use this exact template:
+# [Title]
+## Executive summary
+## Key findings
+## Recommendations
 ```

-Claude reads REDLINING.md or OOXML.md only when the user needs those features.
+**Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
+```markdown
+## Commit message format
+**Example 1:**
+Input: Added user authentication with JWT tokens
+Output: feat(auth): implement JWT-based authentication
+```

-**Important guidelines:**
+### Writing Style

- **Avoid deeply nested references** - Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md.
- **Structure longer reference files** - For files longer than 100 lines, include a table of contents at the top so Claude can see the full scope when previewing.
+Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.

-## Skill Creation Process
+### Test Cases

-Skill creation involves these steps:
+After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.

-1. Understand the skill with concrete examples
-2. Plan reusable skill contents (scripts, references, assets)
-3. Initialize the skill (run init_skill.py)
-4. Edit the skill (implement resources and write SKILL.md)
-5. Package the skill (run package_skill.py)
-6. Iterate based on real usage
+Save test cases to `evals/evals.json`. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.

-Follow these steps in order, skipping only if there is a clear reason why they are not applicable.
+```json
+{
+  "skill_name": "example-skill",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "User's task prompt",
+      "expected_output": "Description of expected result",
+      "files": []
+    }
+  ]
+}
+```

-### Step 1: Understanding the Skill with Concrete Examples
+See `references/schemas.md` for the full schema (including the `assertions` field, which you'll add later).

-Skip this step only when the skill's usage patterns are already clearly understood. It remains valuable even when working with an existing skill.
+## Running and evaluating test cases

-To create an effective skill, clearly understand concrete examples of how the skill will be used. This understanding can come from either direct user examples or generated examples that are validated with user feedback.
+This section is one continuous sequence — don't stop partway through. Do NOT use `/skill-test` or any other testing skill.

-For example, when building an image-editor skill, relevant questions include:
+Put results in `<skill-name>-workspace/` as a sibling to the skill directory. Within the workspace, organize results by iteration (`iteration-1/`, `iteration-2/`, etc.) and within that, each test case gets a directory (`eval-0/`, `eval-1/`, etc.). Don't create all of this upfront — just create directories as you go.

- "What functionality should the image-editor skill support? Editing, rotating, anything else?"
- "Can you give some examples of how this skill would be used?"
- "I can imagine users asking for things like 'Remove the red-eye from this image' or 'Rotate this image'. Are there other ways you imagine this skill being used?"
- "What would a user say that should trigger this skill?"
+### Step 1: Spawn all runs (with-skill AND baseline) in the same turn

-To avoid overwhelming users, avoid asking too many questions in a single message. Start with the most important questions and follow up as needed for better effectiveness.
+For each test case, spawn two subagents in the same turn — one with the skill, one without. This is important: don't spawn the with-skill runs first and then come back for baselines later. Launch everything at once so it all finishes around the same time.

-Conclude this step when there is a clear sense of the functionality the skill should support.
+**With-skill run:**

-### Step 2: Planning the Reusable Skill Contents
+```
+Execute this task:
+- Skill path: <path-to-skill>
+- Task: <eval prompt>
+- Input files: <eval files if any, or "none">
+- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
+- Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">
+```

-To turn concrete examples into an effective skill, analyze each example by:
+**Baseline run** (same prompt, but the baseline depends on context):
+- **Creating a new skill**: no skill at all. Same prompt, no skill path, save to `without_skill/outputs/`.
+- **Improving an existing skill**: the old version. Before editing, snapshot the skill (`cp -r <skill-path> <workspace>/skill-snapshot/`), then point the baseline subagent at the snapshot. Save to `old_skill/outputs/`.

-1. Considering how to execute on the example from scratch
-2. Identifying what scripts, references, and assets would be helpful when executing these workflows repeatedly
+Write an `eval_metadata.json` for each test case (assertions can be empty for now). Give each eval a descriptive name based on what it's testing — not just "eval-0". Use this name for the directory too. If this iteration uses new or modified eval prompts, create these files for each new eval directory — don't assume they carry over from previous iterations.

-Example: When building a `pdf-editor` skill to handle queries like "Help me rotate this PDF," the analysis shows:
+```json
+{
+  "eval_id": 0,
+  "eval_name": "descriptive-name-here",
+  "prompt": "The user's task prompt",
+  "assertions": []
+}
+```

-1. Rotating a PDF requires re-writing the same code each time
-2. A `scripts/rotate_pdf.py` script would be helpful to store in the skill
+### Step 2: While runs are in progress, draft assertions

-Example: When designing a `frontend-webapp-builder` skill for queries like "Build me a todo app" or "Build me a dashboard to track my steps," the analysis shows:
+Don't just wait for the runs to finish — you can use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in `evals/evals.json`, review them and explain what they check.

-1. Writing a frontend webapp requires the same boilerplate HTML/React each time
-2. An `assets/hello-world/` template containing the boilerplate HTML/React project files would be helpful to store in the skill
+Good assertions are objectively verifiable and have descriptive names — they should read clearly in the benchmark viewer so someone glancing at the results immediately understands what each one checks. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.

-Example: When building a `big-query` skill to handle queries like "How many users have logged in today?" the analysis shows:
+Update the `eval_metadata.json` files and `evals/evals.json` with the assertions once drafted. Also explain to the user what they'll see in the viewer — both the qualitative outputs and the quantitative benchmark.

-1. Querying BigQuery requires re-discovering the table schemas and relationships each time
-2. A `references/schema.md` file documenting the table schemas would be helpful to store in the skill
+### Step 3: As runs complete, capture timing data

-To establish the skill's contents, analyze each concrete example to create a list of the reusable resources to include: scripts, references, and assets.
+When each subagent task completes, you receive a notification containing `total_tokens` and `duration_ms`. Save this data immediately to `timing.json` in the run directory:

-### Step 3: Initializing the Skill
+```json
+{
+  "total_tokens": 84852,
+  "duration_ms": 23332,
+  "total_duration_seconds": 23.3
+}
+```

-At this point, it is time to actually create the skill.
+This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.

-Skip this step only if the skill being developed already exists, and iteration or packaging is needed. In this case, continue to the next step.
+### Step 4: Grade, aggregate, and launch the viewer

-When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable.
+Once all runs are done:

-Usage:
+1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and evaluates each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
+
+2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory:
+   ```bash
+   python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
+   ```
+   This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects.
+Put each with_skill version before its baseline counterpart.
+
+3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
+
+4. **Launch the viewer** with both qualitative outputs and quantitative data:
+   ```bash
+   nohup python <skill-creator-path>/eval-viewer/generate_review.py \
+     <workspace>/iteration-N \
+     --skill-name "my-skill" \
+     --benchmark <workspace>/iteration-N/benchmark.json \
+     > /dev/null 2>&1 &
+   VIEWER_PID=$!
+   ```
+   For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
+
+   **Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up.
+
+Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
+
+5. **Tell the user** something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
+
+### What the user sees in the viewer
+
+The "Outputs" tab shows one test case at a time:
+- **Prompt**: the task that was given
+- **Output**: the files the skill produced, rendered inline where possible
+- **Previous Output** (iteration 2+): collapsed section showing last iteration's output
+- **Formal Grades** (if grading was run): collapsed section showing assertion pass/fail
+- **Feedback**: a textbox that auto-saves as they type
+- **Previous Feedback** (iteration 2+): their comments from last time, shown below the textbox
+
+The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.
+
+Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to `feedback.json`.
+
+### Step 5: Read the feedback
+
+When the user tells you they're done, read `feedback.json`:
+
+```json
+{
+  "reviews": [
+    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
+    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
+    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
+  ],
+  "status": "complete"
+}
+```
+
+Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.
+
+Kill the viewer server when you're done with it:

 ```bash
-scripts/init_skill.py <skill-name> --path <output-directory>
+kill $VIEWER_PID 2>/dev/null
 ```

-The script:
+---

- Creates the skill directory at the specified path
- Generates a SKILL.md template with proper frontmatter and TODO placeholders
- Creates example resource directories: `scripts/`, `references/`, and `assets/`
- Adds example files in each directory that can be customized or deleted
+## Improving the skill

-After initialization, customize or remove the generated SKILL.md and example files as needed.
+This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better based on their feedback.

-### Step 4: Edit the Skill
+### How to think about improvements

-When editing the (newly-generated or existing) skill, remember that the skill is being created for another instance of Claude to use. Include information that would be beneficial and non-obvious to Claude. Consider what procedural knowledge, domain-specific details, or reusable assets would help another Claude instance execute these tasks more effectively.
+1. **Generalize from the feedback.** The big picture thing that's happening here is that we're trying to create skills that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddly overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.

-#### Learn Proven Design Patterns
+2. **Keep the prompt lean.** Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.

-Consult these helpful guides based on your skill's needs:
+3. **Explain the why.** Try hard to explain the **why** behind everything you're asking the model to do. Today's LLMs are *smart*. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.

- **Multi-step processes**: See references/workflows.md for sequential workflows and conditional logic
- **Specific output formats or quality standards**: See references/output-patterns.md for template and example patterns
+4. **Look for repeated work across test cases.** Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a `create_docx.py` or a `build_chart.py`, that's a strong signal the skill should bundle that script. Write it once, put it in `scripts/`, and tell the skill to use it. This saves every future invocation from reinventing the wheel.

-These files contain established best practices for effective skill design.
+This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.

-#### Start with Reusable Skill Contents
+### The iteration loop

-To begin implementation, start with the reusable resources identified above: `scripts/`, `references/`, and `assets/` files. Note that this step may require user input. For example, when implementing a `brand-guidelines` skill, the user may need to provide brand assets or templates to store in `assets/`, or documentation to store in `references/`.
+After improving the skill:

-Added scripts must be tested by actually running them to ensure there are no bugs and that the output matches what is expected. If there are many similar scripts, only a representative sample needs to be tested to ensure confidence that they all work while balancing time to completion.
+1. Apply your improvements to the skill
+2. Rerun all test cases into a new `iteration-<N+1>/` directory, including baseline runs. If you're creating a new skill, the baseline is always `without_skill` (no skill) — that stays the same across iterations. If you're improving an existing skill, use your judgment on what makes sense as the baseline: the original version the user came in with, or the previous iteration.
+3. Launch the reviewer with `--previous-workspace` pointing at the previous iteration
+4. Wait for the user to review and tell you they're done
+5. Read the new feedback, improve again, repeat

-Any example files and directories not needed for the skill should be deleted. The initialization script creates example files in `scripts/`, `references/`, and `assets/` to demonstrate structure, but most skills won't need all of them.
+Keep going until:
+- The user says they're happy
+- The feedback is all empty (everything looks good)
+- You're not making meaningful progress

-#### Update SKILL.md
+---

-**Writing Guidelines:** Always use imperative/infinitive form.
+## Advanced: Blind comparison

-##### Frontmatter
+For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read `agents/comparator.md` and `agents/analyzer.md` for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.

-Write the YAML frontmatter with `name` and `description`:
+This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.

- `name`: The skill name
- `description`: This is the primary triggering mechanism for your skill, and helps Claude understand when to use the skill.
-  - Include both what the Skill does and specific triggers/contexts for when to use it.
-  - Include all "when to use" information here - Not in the body. The body is only loaded after triggering, so "When to Use This Skill" sections in the body are not helpful to Claude.
-  - Example description for a `docx` skill: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
+---

-Do not include any other fields in YAML frontmatter.
+## Description Optimization

-##### Body
+The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.

-Write instructions for using the skill and its bundled resources.
+### Step 1: Generate trigger eval queries

-### Step 5: Packaging a Skill
+Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:

-Once development of the skill is complete, it must be packaged into a distributable .skill file that gets shared with the user. The packaging process automatically validates the skill first to ensure it meets all requirements:
+```json
+[
+  {"query": "the user prompt", "should_trigger": true},
+  {"query": "another prompt", "should_trigger": false}
+]
+```
+
+The queries must be realistic and something a Claude Code or Claude.ai user would actually type. Not abstract requests, but requests that are concrete and specific and have a good amount of detail. For instance, file paths, personal context about the user's job or situation, column names and values, company names, URLs. A little bit of backstory. Some might be in lowercase or contain abbreviations or typos or casual speech. Use a mix of different lengths, and focus on edge cases rather than making them clear-cut (the user will get a chance to sign off on them).
+
+Bad: `"Format this data"`, `"Extract text from PDF"`, `"Create a chart"`
+
+Good: `"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"`
+
+For the **should-trigger** queries (8-10), think about coverage. You want different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill or file type but clearly needs it. Throw in some uncommon use cases and cases where this skill competes with another but should win.
+
+For the **should-not-trigger** queries (8-10), the most valuable ones are the near-misses — queries that share keywords or concepts with the skill but actually need something different. Think adjacent domains, ambiguous phrasing where a naive keyword match would trigger but shouldn't, and cases where the query touches on something the skill does but in a context where another tool is more appropriate.
+
+The key thing to avoid: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too easy — it doesn't test anything. The negative cases should be genuinely tricky.
+
+### Step 2: Review with user
+
+Present the eval set to the user for review using the HTML template:
+
+1. Read the template from `assets/eval_review.html`
+2. Replace the placeholders:
+   - `__EVAL_DATA_PLACEHOLDER__` → the JSON array of eval items (no quotes around it — it's a JS variable assignment)
+   - `__SKILL_NAME_PLACEHOLDER__` → the skill's name
+   - `__SKILL_DESCRIPTION_PLACEHOLDER__` → the skill's current description
+3. Write to a temp file (e.g., `/tmp/eval_review_<skill-name>.html`) and open it: `open /tmp/eval_review_<skill-name>.html`
+4. The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
+5. The file downloads to `~/Downloads/eval_set.json` — check the Downloads folder for the most recent version in case there are multiple (e.g., `eval_set (1).json`)
+
+This step matters — bad eval queries lead to bad descriptions.
+
+### Step 3: Run the optimization loop
+
+Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."
+
+Save the eval set to the workspace, then run in the background:

 ```bash
-scripts/package_skill.py <path/to/skill-folder>
+python -m scripts.run_loop \
+  --eval-set <path-to-trigger-eval.json> \
+  --skill-path <path-to-skill> \
+  --model <model-id-powering-this-session> \
+  --max-iterations 5 \
+  --verbose
 ```

-Optional output directory specification:
+Use the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences.
+
+While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
+
+This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
+
+### How skill triggering works
+
+Understanding the triggering mechanism helps design better eval queries. Skills appear in Claude's `available_skills` list with their name + description, and Claude decides whether to consult a skill based on that description. The important thing to know is that Claude only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Claude can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.
+
+This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.
+
+### Step 4: Apply the result
+
+Take `best_description` from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.
+
+---
+
+### Package and Present (only if `present_files` tool is available)
+
+Check whether you have access to the `present_files` tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:

 ```bash
-scripts/package_skill.py <path/to/skill-folder> ./dist
+python -m scripts.package_skill <path/to/skill-folder>
 ```

-The packaging script will:
+After packaging, direct the user to the resulting `.skill` file path so they can install it.

-1. **Validate** the skill automatically, checking:
+---

-   - YAML frontmatter format and required fields
-   - Skill naming conventions and directory structure
-   - Description completeness and quality
-   - File organization and resource references
+## Claude.ai-specific instructions

-2. **Package** the skill if validation passes, creating a .skill file named after the skill (e.g., `my-skill.skill`) that includes all files and maintains the proper directory structure for distribution. The .skill file is a zip file with a .skill extension.
+In Claude.ai, the core workflow is the same (draft → test → review → improve → repeat), but because Claude.ai doesn't have subagents, some mechanics change. Here's what to adapt:

-If validation fails, the script will report the errors and exit without creating a package. Fix any validation errors and run the packaging command again.
+**Running test cases**: No subagents means no parallel execution. For each test case, read the skill's SKILL.md, then follow its instructions to accomplish the test prompt yourself. Do them one at a time. This is less rigorous than independent subagents (you wrote the skill and you're also running it, so you have full context), but it's a useful sanity check — and the human review step compensates. Skip the baseline runs — just use the skill to complete the task as requested.

-### Step 6: Iterate
+**Reviewing results**: If you can't open a browser (e.g., Claude.ai's VM has no display, or you're on a remote server), skip the browser reviewer entirely. Instead, present results directly in the conversation. For each test case, show the prompt and the output. If the output is a file the user needs to see (like a .docx or .xlsx), save it to the filesystem and tell them where it is so they can download and inspect it. Ask for feedback inline: "How does this look? Anything you'd change?"

-After testing the skill, users may request improvements. Often this happens right after using the skill, with fresh context of how the skill performed.
+**Benchmarking**: Skip the quantitative benchmarking — it relies on baseline comparisons which aren't meaningful without subagents. Focus on qualitative feedback from the user.

-**Iteration workflow:**
+**The iteration loop**: Same as before — improve the skill, rerun the test cases, ask for feedback — just without the browser reviewer in the middle. You can still organize results into iteration directories on the filesystem if you have one.

-1. Use the skill on real tasks
-2. Notice struggles or inefficiencies
-3. Identify how SKILL.md or bundled resources should be updated
-4. Implement changes and test again
+**Description optimization**: This section requires the `claude` CLI tool (specifically `claude -p`) which is only available in Claude Code. Skip it if you're on Claude.ai.
+
+**Blind comparison**: Requires subagents. Skip it.
+
+**Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
+
+**Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
+- **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
+- **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
+- **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.
+
+---
+
+## Cowork-Specific Instructions
+
+If you're in Cowork, the main things to know are:
+
+- You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
+- You don't have a browser or display, so when generating the eval viewer, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
+- For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.py` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER *BEFORE* evaluating inputs yourself. You want to get them in front of the human ASAP!
+- Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
+- Packaging works — `package_skill.py` just needs Python and a filesystem.
+- Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
+- **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
+
+---
+
+## Reference files
+
+The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.
+
+- `agents/grader.md` — How to evaluate assertions against outputs
+- `agents/comparator.md` — How to do blind A/B comparison between two outputs
+- `agents/analyzer.md` — How to analyze why one version beat another
+
+The references/ directory has additional documentation:
+- `references/schemas.md` — JSON structures for evals.json, grading.json, etc.
+
+---
+
+Repeating one more time the core loop here for emphasis:
+
+- Figure out what the skill is about
+- Draft or edit the skill
+- Run claude-with-access-to-the-skill on test prompts
+- With the user, evaluate the outputs:
+  - Create benchmark.json and run `eval-viewer/generate_review.py` to help the user review them
+  - Run quantitative evals
+- Repeat until you and the user are satisfied
+- Package the final skill and return it to the user.
+
+Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
+
+Good luck!
--- a/skills/public/skill-creator/agents/analyzer.md
+++ b/skills/public/skill-creator/agents/analyzer.md
@@ -0,0 +1,274 @@
+# Post-hoc Analyzer Agent
+
+Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
+
+## Role
+
+After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
+
+## Inputs
+
+You receive these parameters in your prompt:
+
+- **winner**: "A" or "B" (from blind comparison)
+- **winner_skill_path**: Path to the skill that produced the winning output
+- **winner_transcript_path**: Path to the execution transcript for the winner
+- **loser_skill_path**: Path to the skill that produced the losing output
+- **loser_transcript_path**: Path to the execution transcript for the loser
+- **comparison_result_path**: Path to the blind comparator's output JSON
+- **output_path**: Where to save the analysis results
+
+## Process
+
+### Step 1: Read Comparison Result
+
+1. Read the blind comparator's output at comparison_result_path
+2. Note the winning side (A or B), the reasoning, and any scores
+3. Understand what the comparator valued in the winning output
+
+### Step 2: Read Both Skills
+
+1. Read the winner skill's SKILL.md and key referenced files
+2. Read the loser skill's SKILL.md and key referenced files
+3. Identify structural differences:
+   - Instructions clarity and specificity
+   - Script/tool usage patterns
+   - Example coverage
+   - Edge case handling
+
+### Step 3: Read Both Transcripts
+
+1. Read the winner's transcript
+2. Read the loser's transcript
+3. Compare execution patterns:
+   - How closely did each follow their skill's instructions?
+   - What tools were used differently?
+   - Where did the loser diverge from optimal behavior?
+   - Did either encounter errors or make recovery attempts?
+
+### Step 4: Analyze Instruction Following
+
+For each transcript, evaluate:
+- Did the agent follow the skill's explicit instructions?
+- Did the agent use the skill's provided tools/scripts?
+- Were there missed opportunities to leverage skill content?
+- Did the agent add unnecessary steps not in the skill?
+
+Score instruction following 1-10 and note specific issues.
+
+### Step 5: Identify Winner Strengths
+
+Determine what made the winner better:
+- Clearer instructions that led to better behavior?
+- Better scripts/tools that produced better output?
+- More comprehensive examples that guided edge cases?
+- Better error handling guidance?
+
+Be specific. Quote from skills/transcripts where relevant.
+
+### Step 6: Identify Loser Weaknesses
+
+Determine what held the loser back:
+- Ambiguous instructions that led to suboptimal choices?
+- Missing tools/scripts that forced workarounds?
+- Gaps in edge case coverage?
+- Poor error handling that caused failures?
+
+### Step 7: Generate Improvement Suggestions
+
+Based on the analysis, produce actionable suggestions for improving the loser skill:
+- Specific instruction changes to make
+- Tools/scripts to add or modify
+- Examples to include
+- Edge cases to address
+
+Prioritize by impact. Focus on changes that would have changed the outcome.
+
+### Step 8: Write Analysis Results
+
+Save structured analysis to `{output_path}`.
+
+## Output Format
+
+Write a JSON file with this structure:
+
+```json
+{
+  "comparison_summary": {
+    "winner": "A",
+    "winner_skill": "path/to/winner/skill",
+    "loser_skill": "path/to/loser/skill",
+    "comparator_reasoning": "Brief summary of why comparator chose winner"
+  },
+  "winner_strengths": [
+    "Clear step-by-step instructions for handling multi-page documents",
+    "Included validation script that caught formatting errors",
+    "Explicit guidance on fallback behavior when OCR fails"
+  ],
+  "loser_weaknesses": [
+    "Vague instruction 'process the document appropriately' led to inconsistent behavior",
+    "No script for validation, agent had to improvise and made errors",
+    "No guidance on OCR failure, agent gave up instead of trying alternatives"
+  ],
+  "instruction_following": {
+    "winner": {
+      "score": 9,
+      "issues": [
+        "Minor: skipped optional logging step"
+      ]
+    },
+    "loser": {
+      "score": 6,
+      "issues": [
+        "Did not use the skill's formatting template",
+        "Invented own approach instead of following step 3",
+        "Missed the 'always validate output' instruction"
+      ]
+    }
+  },
+  "improvement_suggestions": [
+    {
+      "priority": "high",
+      "category": "instructions",
+      "suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
+      "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
+    },
+    {
+      "priority": "high",
+      "category": "tools",
+      "suggestion": "Add validate_output.py script similar to winner skill's validation approach",
+      "expected_impact": "Would catch formatting errors before final output"
+    },
+    {
+      "priority": "medium",
+      "category": "error_handling",
+      "suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
+      "expected_impact": "Would prevent early failure on difficult documents"
+    }
+  ],
+  "transcript_insights": {
+    "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
+    "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
+  }
+}
+```
+
+## Guidelines
+
+- **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
+- **Be actionable**: Suggestions should be concrete changes, not vague advice
+- **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
+- **Prioritize by impact**: Which changes would most likely have changed the outcome?
+- **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
+- **Stay objective**: Analyze what happened, don't editorialize
+- **Think about generalization**: Would this improvement help on other evals too?
+
+## Categories for Suggestions
+
+Use these categories to organize improvement suggestions:
+
+| Category | Description |
+|----------|-------------|
+| `instructions` | Changes to the skill's prose instructions |
+| `tools` | Scripts, templates, or utilities to add/modify |
+| `examples` | Example inputs/outputs to include |
+| `error_handling` | Guidance for handling failures |
+| `structure` | Reorganization of skill content |
+| `references` | External docs or resources to add |
+
+## Priority Levels
+
+- **high**: Would likely change the outcome of this comparison
+- **medium**: Would improve quality but may not change win/loss
+- **low**: Nice to have, marginal improvement
+
+---
+
+# Analyzing Benchmark Results
+
+When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
+
+## Role
+
+Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
+
+## Inputs
+
+You receive these parameters in your prompt:
+
+- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
+- **skill_path**: Path to the skill being benchmarked
+- **output_path**: Where to save the notes (as JSON array of strings)
+
+## Process
+
+### Step 1: Read Benchmark Data
+
+1. Read the benchmark.json containing all run results
+2. Note the configurations tested (with_skill, without_skill)
+3. Understand the run_summary aggregates already calculated
+
+### Step 2: Analyze Per-Assertion Patterns
+
+For each expectation across all runs:
+- Does it **always pass** in both configurations? (may not differentiate skill value)
+- Does it **always fail** in both configurations? (may be broken or beyond capability)
+- Does it **always pass with skill but fail without**? (skill clearly adds value here)
+- Does it **always fail with skill but pass without**? (skill may be hurting)
+- Is it **highly variable**? (flaky expectation or non-deterministic behavior)
+
+### Step 3: Analyze Cross-Eval Patterns
+
+Look for patterns across evals:
+- Are certain eval types consistently harder/easier?
+- Do some evals show high variance while others are stable?
+- Are there surprising results that contradict expectations?
+
+### Step 4: Analyze Metrics Patterns
+
+Look at time_seconds, tokens, tool_calls:
+- Does the skill significantly increase execution time?
+- Is there high variance in resource usage?
+- Are there outlier runs that skew the aggregates?
+
+### Step 5: Generate Notes
+
+Write freeform observations as a list of strings. Each note should:
+- State a specific observation
+- Be grounded in the data (not speculation)
+- Help the user understand something the aggregate metrics don't show
+
+Examples:
+- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
+- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
+- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
+- "Skill adds 13s average execution time but improves pass rate by 50%"
+- "Token usage is 80% higher with skill, primarily due to script output parsing"
+- "All 3 without-skill runs for eval 1 produced empty output"
+
+### Step 6: Write Notes
+
+Save notes to `{output_path}` as a JSON array of strings:
+
+```json
+[
+  "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
+  "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
+  "Without-skill runs consistently fail on table extraction expectations",
+  "Skill adds 13s average execution time but improves pass rate by 50%"
+]
+```
+
+## Guidelines
+
+**DO:**
+- Report what you observe in the data
+- Be specific about which evals, expectations, or runs you're referring to
+- Note patterns that aggregate metrics would hide
+- Provide context that helps interpret the numbers
+
+**DO NOT:**
+- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
+- Make subjective quality judgments ("the output was good/bad")
+- Speculate about causes without evidence
+- Repeat information already in the run_summary aggregates
--- a/skills/public/skill-creator/agents/comparator.md
+++ b/skills/public/skill-creator/agents/comparator.md
@@ -0,0 +1,202 @@
+# Blind Comparator Agent
+
+Compare two outputs WITHOUT knowing which skill produced them.
+
+## Role
+
+The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.
+
+Your judgment is based purely on output quality and task completion.
+
+## Inputs
+
+You receive these parameters in your prompt:
+
+- **output_a_path**: Path to the first output file or directory
+- **output_b_path**: Path to the second output file or directory
+- **eval_prompt**: The original task/prompt that was executed
+- **expectations**: List of expectations to check (optional - may be empty)
+
+## Process
+
+### Step 1: Read Both Outputs
+
+1. Examine output A (file or directory)
+2. Examine output B (file or directory)
+3. Note the type, structure, and content of each
+4. If outputs are directories, examine all relevant files inside
+
+### Step 2: Understand the Task
+
+1. Read the eval_prompt carefully
+2. Identify what the task requires:
+   - What should be produced?
+   - What qualities matter (accuracy, completeness, format)?
+   - What would distinguish a good output from a poor one?
+
+### Step 3: Generate Evaluation Rubric
+
+Based on the task, generate a rubric with two dimensions:
+
+**Content Rubric** (what the output contains):
+| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
+|-----------|----------|----------------|---------------|
+| Correctness | Major errors | Minor errors | Fully correct |
+| Completeness | Missing key elements | Mostly complete | All elements present |
+| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
+
+**Structure Rubric** (how the output is organized):
+| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
+|-----------|----------|----------------|---------------|
+| Organization | Disorganized | Reasonably organized | Clear, logical structure |
+| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
+| Usability | Difficult to use | Usable with effort | Easy to use |
+
+Adapt criteria to the specific task. For example:
+- PDF form → "Field alignment", "Text readability", "Data placement"
+- Document → "Section structure", "Heading hierarchy", "Paragraph flow"
+- Data output → "Schema correctness", "Data types", "Completeness"
+
+### Step 4: Evaluate Each Output Against the Rubric
+
+For each output (A and B):
+
+1. **Score each criterion** on the rubric (1-5 scale)
+2. **Calculate dimension totals**: Content score, Structure score
+3. **Calculate overall score**: Average of dimension scores, scaled to 1-10
+
+### Step 5: Check Assertions (if provided)
+
+If expectations are provided:
+
+1. Check each expectation against output A
+2. Check each expectation against output B
+3. Count pass rates for each output
+4. Use expectation scores as secondary evidence (not the primary decision factor)
+
+### Step 6: Determine the Winner
+
+Compare A and B based on (in priority order):
+
+1. **Primary**: Overall rubric score (content + structure)
+2. **Secondary**: Assertion pass rates (if applicable)
+3. **Tiebreaker**: If truly equal, declare a TIE
+
+Be decisive - ties should be rare. One output is usually better, even if marginally.
+
+### Step 7: Write Comparison Results
+
+Save results to a JSON file at the path specified (or `comparison.json` if not specified).
+
+## Output Format
+
+Write a JSON file with this structure:
+
+```json
+{
+  "winner": "A",
+  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
+  "rubric": {
+    "A": {
+      "content": {
+        "correctness": 5,
+        "completeness": 5,
+        "accuracy": 4
+      },
+      "structure": {
+        "organization": 4,
+        "formatting": 5,
+        "usability": 4
+      },
+      "content_score": 4.7,
+      "structure_score": 4.3,
+      "overall_score": 9.0
+    },
+    "B": {
+      "content": {
+        "correctness": 3,
+        "completeness": 2,
+        "accuracy": 3
+      },
+      "structure": {
+        "organization": 3,
+        "formatting": 2,
+        "usability": 3
+      },
+      "content_score": 2.7,
+      "structure_score": 2.7,
+      "overall_score": 5.4
+    }
+  },
+  "output_quality": {
+    "A": {
+      "score": 9,
+      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
+      "weaknesses": ["Minor style inconsistency in header"]
+    },
+    "B": {
+      "score": 5,
+      "strengths": ["Readable output", "Correct basic structure"],
+      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
+    }
+  },
+  "expectation_results": {
+    "A": {
+      "passed": 4,
+      "total": 5,
+      "pass_rate": 0.80,
+      "details": [
+        {"text": "Output includes name", "passed": true},
+        {"text": "Output includes date", "passed": true},
+        {"text": "Format is PDF", "passed": true},
+        {"text": "Contains signature", "passed": false},
+        {"text": "Readable text", "passed": true}
+      ]
+    },
+    "B": {
+      "passed": 3,
+      "total": 5,
+      "pass_rate": 0.60,
+      "details": [
+        {"text": "Output includes name", "passed": true},
+        {"text": "Output includes date", "passed": false},
+        {"text": "Format is PDF", "passed": true},
+        {"text": "Contains signature", "passed": false},
+        {"text": "Readable text", "passed": true}
+      ]
+    }
+  }
+}
+```
+
+If no expectations were provided, omit the `expectation_results` field entirely.
+
+## Field Descriptions
+
+- **winner**: "A", "B", or "TIE"
+- **reasoning**: Clear explanation of why the winner was chosen (or why it's a tie)
+- **rubric**: Structured rubric evaluation for each output
+  - **content**: Scores for content criteria (correctness, completeness, accuracy)
+  - **structure**: Scores for structure criteria (organization, formatting, usability)
+  - **content_score**: Average of content criteria (1-5)
+  - **structure_score**: Average of structure criteria (1-5)
+  - **overall_score**: Combined score scaled to 1-10
+- **output_quality**: Summary quality assessment
+  - **score**: 1-10 rating (should match rubric overall_score)
+  - **strengths**: List of positive aspects
+  - **weaknesses**: List of issues or shortcomings
+- **expectation_results**: (Only if expectations provided)
+  - **passed**: Number of expectations that passed
+  - **total**: Total number of expectations
+  - **pass_rate**: Fraction passed (0.0 to 1.0)
+  - **details**: Individual expectation results
+
+## Guidelines
+
+- **Stay blind**: DO NOT try to infer which skill produced which output. Judge purely on output quality.
+- **Be specific**: Cite specific examples when explaining strengths and weaknesses.
+- **Be decisive**: Choose a winner unless outputs are genuinely equivalent.
+- **Output quality first**: Assertion scores are secondary to overall task completion.
+- **Be objective**: Don't favor outputs based on style preferences; focus on correctness and completeness.
+- **Explain your reasoning**: The reasoning field should make it clear why you chose the winner.
+- **Handle edge cases**: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.
--- a/skills/public/skill-creator/agents/grader.md
+++ b/skills/public/skill-creator/agents/grader.md
@@ -0,0 +1,223 @@
+# Grader Agent
+
+Evaluate expectations against an execution transcript and outputs.
+
+## Role
+
+The Grader reviews a transcript and output files, then determines whether each expectation passes or fails. Provide clear evidence for each judgment.
+
+You have two jobs: grade the outputs, and critique the evals themselves. A passing grade on a weak assertion is worse than useless — it creates false confidence. When you notice an assertion that's trivially satisfied, or an important outcome that no assertion checks, say so.
+
+## Inputs
+
+You receive these parameters in your prompt:
+
+- **expectations**: List of expectations to evaluate (strings)
+- **transcript_path**: Path to the execution transcript (markdown file)
+- **outputs_dir**: Directory containing output files from execution
+
+## Process
+
+### Step 1: Read the Transcript
+
+1. Read the transcript file completely
+2. Note the eval prompt, execution steps, and final result
+3. Identify any issues or errors documented
+
+### Step 2: Examine Output Files
+
+1. List files in outputs_dir
+2. Read/examine each file relevant to the expectations. If outputs aren't plain text, use the inspection tools provided in your prompt — don't rely solely on what the transcript says the executor produced.
+3. Note contents, structure, and quality
+
+### Step 3: Evaluate Each Assertion
+
+For each expectation:
+
+1. **Search for evidence** in the transcript and outputs
+2. **Determine verdict**:
+   - **PASS**: Clear evidence the expectation is true AND the evidence reflects genuine task completion, not just surface-level compliance
+   - **FAIL**: No evidence, or evidence contradicts the expectation, or the evidence is superficial (e.g., correct filename but empty/wrong content)
+3. **Cite the evidence**: Quote the specific text or describe what you found
+
+### Step 4: Extract and Verify Claims
+
+Beyond the predefined expectations, extract implicit claims from the outputs and verify them:
+
+1. **Extract claims** from the transcript and outputs:
+   - Factual statements ("The form has 12 fields")
+   - Process claims ("Used pypdf to fill the form")
+   - Quality claims ("All fields were filled correctly")
+
+2. **Verify each claim**:
+   - **Factual claims**: Can be checked against the outputs or external sources
+   - **Process claims**: Can be verified from the transcript
+   - **Quality claims**: Evaluate whether the claim is justified
+
+3. **Flag unverifiable claims**: Note claims that cannot be verified with available information
+
+This catches issues that predefined expectations might miss.
+
+### Step 5: Read User Notes
+
+If `{outputs_dir}/user_notes.md` exists:
+1. Read it and note any uncertainties or issues flagged by the executor
+2. Include relevant concerns in the grading output
+3. These may reveal problems even when expectations pass
+
+### Step 6: Critique the Evals
+
+After grading, consider whether the evals themselves could be improved. Only surface suggestions when there's a clear gap.
+
+Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion *discriminating*: it passes when the skill genuinely succeeds and fails when it doesn't.
+
+Suggestions worth raising:
+- An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content)
+- An important outcome you observed — good or bad — that no assertion covers at all
+- An assertion that can't actually be verified from the available outputs
+
+Keep the bar high. The goal is to flag things the eval author would say "good catch" about, not to nitpick every assertion.
+
+### Step 7: Write Grading Results
+
+Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir).
+
+## Grading Criteria
+
+**PASS when**:
+- The transcript or outputs clearly demonstrate the expectation is true
+- Specific evidence can be cited
+- The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename)
+
+**FAIL when**:
+- No evidence found for the expectation
+- Evidence contradicts the expectation
+- The expectation cannot be verified from available information
+- The evidence is superficial — the assertion is technically satisfied but the underlying task outcome is wrong or incomplete
+- The output appears to meet the assertion by coincidence rather than by actually doing the work
+
+**When uncertain**: The burden of proof to pass is on the expectation.
+
+### Step 8: Read Executor Metrics and Timing
+
+1. If `{outputs_dir}/metrics.json` exists, read it and include in grading output
+2. If `{outputs_dir}/../timing.json` exists, read it and include timing data
+
+## Output Format
+
+Write a JSON file with this structure:
+
+```json
+{
+  "expectations": [
+    {
+      "text": "The output includes the name 'John Smith'",
+      "passed": true,
+      "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
+    },
+    {
+      "text": "The spreadsheet has a SUM formula in cell B10",
+      "passed": false,
+      "evidence": "No spreadsheet was created. The output was a text file."
+    },
+    {
+      "text": "The assistant used the skill's OCR script",
+      "passed": true,
+      "evidence": "Transcript Step 2 shows: 'Tool: Bash - python ocr_script.py image.png'"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 1,
+    "total": 3,
+    "pass_rate": 0.67
+  },
+  "execution_metrics": {
+    "tool_calls": {
+      "Read": 5,
+      "Write": 2,
+      "Bash": 8
+    },
+    "total_tool_calls": 15,
+    "total_steps": 6,
+    "errors_encountered": 0,
+    "output_chars": 12450,
+    "transcript_chars": 3200
+  },
+  "timing": {
+    "executor_duration_seconds": 165.0,
+    "grader_duration_seconds": 26.0,
+    "total_duration_seconds": 191.0
+  },
+  "claims": [
+    {
+      "claim": "The form has 12 fillable fields",
+      "type": "factual",
+      "verified": true,
+      "evidence": "Counted 12 fields in field_info.json"
+    },
+    {
+      "claim": "All required fields were populated",
+      "type": "quality",
+      "verified": false,
+      "evidence": "Reference section was left blank despite data being available"
+    }
+  ],
+  "user_notes_summary": {
+    "uncertainties": ["Used 2023 data, may be stale"],
+    "needs_review": [],
+    "workarounds": ["Fell back to text overlay for non-fillable fields"]
+  },
+  "eval_feedback": {
+    "suggestions": [
+      {
+        "assertion": "The output includes the name 'John Smith'",
+        "reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email from the input"
+      },
+      {
+        "reason": "No assertion checks whether the extracted phone numbers match the input — I observed incorrect numbers in the output that went uncaught"
+      }
+    ],
+    "overall": "Assertions check presence but not correctness. Consider adding content verification."
+  }
+}
+```
+
+## Field Descriptions
+
+- **expectations**: Array of graded expectations
+  - **text**: The original expectation text
+  - **passed**: Boolean - true if expectation passes
+  - **evidence**: Specific quote or description supporting the verdict
+- **summary**: Aggregate statistics
+  - **passed**: Count of passed expectations
+  - **failed**: Count of failed expectations
+  - **total**: Total expectations evaluated
+  - **pass_rate**: Fraction passed (0.0 to 1.0)
+- **execution_metrics**: Copied from executor's metrics.json (if available)
+  - **output_chars**: Total character count of output files (proxy for tokens)
+  - **transcript_chars**: Character count of transcript
+- **timing**: Wall clock timing from timing.json (if available)
+  - **executor_duration_seconds**: Time spent in executor subagent
+  - **total_duration_seconds**: Total elapsed time for the run
+- **claims**: Extracted and verified claims from the output
+  - **claim**: The statement being verified
+  - **type**: "factual", "process", or "quality"
+  - **verified**: Boolean - whether the claim holds
+  - **evidence**: Supporting or contradicting evidence
+- **user_notes_summary**: Issues flagged by the executor
+  - **uncertainties**: Things the executor wasn't sure about
+  - **needs_review**: Items requiring human attention
+  - **workarounds**: Places where the skill didn't work as expected
+- **eval_feedback**: Improvement suggestions for the evals (only when warranted)
+  - **suggestions**: List of concrete suggestions, each with a `reason` and optionally an `assertion` it relates to
+  - **overall**: Brief assessment — can be "No suggestions, evals look solid" if nothing to flag
+
+## Guidelines
+
+- **Be objective**: Base verdicts on evidence, not assumptions
+- **Be specific**: Quote the exact text that supports your verdict
+- **Be thorough**: Check both transcript and output files
+- **Be consistent**: Apply the same standard to each expectation
+- **Explain failures**: Make it clear why evidence was insufficient
+- **No partial credit**: Each expectation is pass or fail, not partial
--- a/skills/public/skill-creator/assets/eval_review.html
+++ b/skills/public/skill-creator/assets/eval_review.html
@@ -0,0 +1,146 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Eval Set Review - __SKILL_NAME_PLACEHOLDER__</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body { font-family: 'Lora', Georgia, serif; background: #faf9f5; padding: 2rem; color: #141413; }
+    h1 { font-family: 'Poppins', sans-serif; margin-bottom: 0.5rem; font-size: 1.5rem; }
+    .description { color: #b0aea5; margin-bottom: 1.5rem; font-style: italic; max-width: 900px; }
+    .controls { margin-bottom: 1rem; display: flex; gap: 0.5rem; }
+    .btn { font-family: 'Poppins', sans-serif; padding: 0.5rem 1rem; border: none; border-radius: 6px; cursor: pointer; font-size: 0.875rem; font-weight: 500; }
+    .btn-add { background: #6a9bcc; color: white; }
+    .btn-add:hover { background: #5889b8; }
+    .btn-export { background: #d97757; color: white; }
+    .btn-export:hover { background: #c4613f; }
+    table { width: 100%; max-width: 1100px; border-collapse: collapse; background: white; border-radius: 6px; overflow: hidden; box-shadow: 0 1px 3px rgba(0,0,0,0.08); }
+    th { font-family: 'Poppins', sans-serif; background: #141413; color: #faf9f5; padding: 0.75rem 1rem; text-align: left; font-size: 0.875rem; }
+    td { padding: 0.75rem 1rem; border-bottom: 1px solid #e8e6dc; vertical-align: top; }
+    tr:nth-child(even) td { background: #faf9f5; }
+    tr:hover td { background: #f3f1ea; }
+    .section-header td { background: #e8e6dc; font-family: 'Poppins', sans-serif; font-weight: 500; font-size: 0.8rem; color: #141413; text-transform: uppercase; letter-spacing: 0.05em; }
+    .query-input { width: 100%; padding: 0.4rem; border: 1px solid #e8e6dc; border-radius: 4px; font-size: 0.875rem; font-family: 'Lora', Georgia, serif; resize: vertical; min-height: 60px; }
+    .query-input:focus { outline: none; border-color: #d97757; box-shadow: 0 0 0 2px rgba(217,119,87,0.15); }
+    .toggle { position: relative; display: inline-block; width: 44px; height: 24px; }
+    .toggle input { opacity: 0; width: 0; height: 0; }
+    .toggle .slider { position: absolute; inset: 0; background: #b0aea5; border-radius: 24px; cursor: pointer; transition: 0.2s; }
+    .toggle .slider::before { content: ""; position: absolute; width: 18px; height: 18px; left: 3px; bottom: 3px; background: white; border-radius: 50%; transition: 0.2s; }
+    .toggle input:checked + .slider { background: #d97757; }
+    .toggle input:checked + .slider::before { transform: translateX(20px); }
+    .btn-delete { background: #c44; color: white; padding: 0.3rem 0.6rem; border: none; border-radius: 4px; cursor: pointer; font-size: 0.75rem; font-family: 'Poppins', sans-serif; }
+    .btn-delete:hover { background: #a33; }
+    .summary { margin-top: 1rem; color: #b0aea5; font-size: 0.875rem; }
+  </style>
+</head>
+<body>
+  <h1>Eval Set Review: <span id="skill-name">__SKILL_NAME_PLACEHOLDER__</span></h1>
+  <p class="description">Current description: <span id="skill-desc">__SKILL_DESCRIPTION_PLACEHOLDER__</span></p>
+
+  <div class="controls">
+    <button class="btn btn-add" onclick="addRow()">+ Add Query</button>
+    <button class="btn btn-export" onclick="exportEvalSet()">Export Eval Set</button>
+  </div>
+
+  <table>
+    <thead>
+      <tr>
+        <th style="width:65%">Query</th>
+        <th style="width:18%">Should Trigger</th>
+        <th style="width:10%">Actions</th>
+      </tr>
+    </thead>
+    <tbody id="eval-body"></tbody>
+  </table>
+
+  <p class="summary" id="summary"></p>
+
+  <script>
+    const EVAL_DATA = __EVAL_DATA_PLACEHOLDER__;
+
+    let evalItems = [...EVAL_DATA];
+
+    function render() {
+      const tbody = document.getElementById('eval-body');
+      tbody.innerHTML = '';
+
+      // Sort: should-trigger first, then should-not-trigger
+      const sorted = evalItems
+        .map((item, origIdx) => ({ ...item, origIdx }))
+        .sort((a, b) => (b.should_trigger ? 1 : 0) - (a.should_trigger ? 1 : 0));
+
+      let lastGroup = null;
+      sorted.forEach(item => {
+        const group = item.should_trigger ? 'trigger' : 'no-trigger';
+        if (group !== lastGroup) {
+          const headerRow = document.createElement('tr');
+          headerRow.className = 'section-header';
+          headerRow.innerHTML = `<td colspan="3">${item.should_trigger ? 'Should Trigger' : 'Should NOT Trigger'}</td>`;
+          tbody.appendChild(headerRow);
+          lastGroup = group;
+        }
+
+        const idx = item.origIdx;
+        const tr = document.createElement('tr');
+        tr.innerHTML = `
+          <td><textarea class="query-input" onchange="updateQuery(${idx}, this.value)">${escapeHtml(item.query)}</textarea></td>
+          <td>
+            <label class="toggle">
+              <input type="checkbox" ${item.should_trigger ? 'checked' : ''} onchange="updateTrigger(${idx}, this.checked)">
+              <span class="slider"></span>
+            </label>
+            <span style="margin-left:8px;font-size:0.8rem;color:#b0aea5">${item.should_trigger ? 'Yes' : 'No'}</span>
+          </td>
+          <td><button class="btn-delete" onclick="deleteRow(${idx})">Delete</button></td>
+        `;
+        tbody.appendChild(tr);
+      });
+      updateSummary();
+    }
+
+    function escapeHtml(text) {
+      const div = document.createElement('div');
+      div.textContent = text;
+      return div.innerHTML;
+    }
+
+    function updateQuery(idx, value) { evalItems[idx].query = value; updateSummary(); }
+    function updateTrigger(idx, value) { evalItems[idx].should_trigger = value; render(); }
+    function deleteRow(idx) { evalItems.splice(idx, 1); render(); }
+
+    function addRow() {
+      evalItems.push({ query: '', should_trigger: true });
+      render();
+      const inputs = document.querySelectorAll('.query-input');
+      inputs[inputs.length - 1].focus();
+    }
+
+    function updateSummary() {
+      const trigger = evalItems.filter(i => i.should_trigger).length;
+      const noTrigger = evalItems.filter(i => !i.should_trigger).length;
+      document.getElementById('summary').textContent =
+        `${evalItems.length} queries total: ${trigger} should trigger, ${noTrigger} should not trigger`;
+    }
+
+    function exportEvalSet() {
+      const valid = evalItems.filter(i => i.query.trim() !== '');
+      const data = valid.map(i => ({ query: i.query.trim(), should_trigger: i.should_trigger }));
+      const blob = new Blob([JSON.stringify(data, null, 2)], { type: 'application/json' });
+      const url = URL.createObjectURL(blob);
+      const a = document.createElement('a');
+      a.href = url;
+      a.download = 'eval_set.json';
+      document.body.appendChild(a);
+      a.click();
+      document.body.removeChild(a);
+      URL.revokeObjectURL(url);
+    }
+
+    render();
+  </script>
+</body>
+</html>
--- a/skills/public/skill-creator/eval-viewer/generate_review.py
+++ b/skills/public/skill-creator/eval-viewer/generate_review.py
@@ -0,0 +1,471 @@
+#!/usr/bin/env python3
+"""Generate and serve a review page for eval results.
+
+Reads the workspace directory, discovers runs (directories with outputs/),
+embeds all output data into a self-contained HTML page, and serves it via
+a tiny HTTP server. Feedback auto-saves to feedback.json in the workspace.
+
+Usage:
+    python generate_review.py <workspace-path> [--port PORT] [--skill-name NAME]
+    python generate_review.py <workspace-path> --previous-feedback /path/to/old/feedback.json
+
+No dependencies beyond the Python stdlib are required.
+"""
+
+import argparse
+import base64
+import json
+import mimetypes
+import os
+import re
+import signal
+import subprocess
+import sys
+import time
+import webbrowser
+from functools import partial
+from http.server import HTTPServer, BaseHTTPRequestHandler
+from pathlib import Path
+
+# Files to exclude from output listings
+METADATA_FILES = {"transcript.md", "user_notes.md", "metrics.json"}
+
+# Extensions we render as inline text
+TEXT_EXTENSIONS = {
+    ".txt", ".md", ".json", ".csv", ".py", ".js", ".ts", ".tsx", ".jsx",
+    ".yaml", ".yml", ".xml", ".html", ".css", ".sh", ".rb", ".go", ".rs",
+    ".java", ".c", ".cpp", ".h", ".hpp", ".sql", ".r", ".toml",
+}
+
+# Extensions we render as inline images
+IMAGE_EXTENSIONS = {".png", ".jpg", ".jpeg", ".gif", ".svg", ".webp"}
+
+# MIME type overrides for common types
+MIME_OVERRIDES = {
+    ".svg": "image/svg+xml",
+    ".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+    ".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+    ".pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+}
+
+
+def get_mime_type(path: Path) -> str:
+    ext = path.suffix.lower()
+    if ext in MIME_OVERRIDES:
+        return MIME_OVERRIDES[ext]
+    mime, _ = mimetypes.guess_type(str(path))
+    return mime or "application/octet-stream"
+
+
+def find_runs(workspace: Path) -> list[dict]:
+    """Recursively find directories that contain an outputs/ subdirectory."""
+    runs: list[dict] = []
+    _find_runs_recursive(workspace, workspace, runs)
+    runs.sort(key=lambda r: (r.get("eval_id", float("inf")), r["id"]))
+    return runs
+
+
+def _find_runs_recursive(root: Path, current: Path, runs: list[dict]) -> None:
+    if not current.is_dir():
+        return
+
+    outputs_dir = current / "outputs"
+    if outputs_dir.is_dir():
+        run = build_run(root, current)
+        if run:
+            runs.append(run)
+        return
+
+    skip = {"node_modules", ".git", "__pycache__", "skill", "inputs"}
+    for child in sorted(current.iterdir()):
+        if child.is_dir() and child.name not in skip:
+            _find_runs_recursive(root, child, runs)
+
+
+def build_run(root: Path, run_dir: Path) -> dict | None:
+    """Build a run dict with prompt, outputs, and grading data."""
+    prompt = ""
+    eval_id = None
+
+    # Try eval_metadata.json
+    for candidate in [run_dir / "eval_metadata.json", run_dir.parent / "eval_metadata.json"]:
+        if candidate.exists():
+            try:
+                metadata = json.loads(candidate.read_text())
+                prompt = metadata.get("prompt", "")
+                eval_id = metadata.get("eval_id")
+            except (json.JSONDecodeError, OSError):
+                pass
+            if prompt:
+                break
+
+    # Fall back to transcript.md
+    if not prompt:
+        for candidate in [run_dir / "transcript.md", run_dir / "outputs" / "transcript.md"]:
+            if candidate.exists():
+                try:
+                    text = candidate.read_text()
+                    match = re.search(r"## Eval Prompt\n\n([\s\S]*?)(?=\n##|$)", text)
+                    if match:
+                        prompt = match.group(1).strip()
+                except OSError:
+                    pass
+                if prompt:
+                    break
+
+    if not prompt:
+        prompt = "(No prompt found)"
+
+    run_id = str(run_dir.relative_to(root)).replace("/", "-").replace("\\", "-")
+
+    # Collect output files
+    outputs_dir = run_dir / "outputs"
+    output_files: list[dict] = []
+    if outputs_dir.is_dir():
+        for f in sorted(outputs_dir.iterdir()):
+            if f.is_file() and f.name not in METADATA_FILES:
+                output_files.append(embed_file(f))
+
+    # Load grading if present
+    grading = None
+    for candidate in [run_dir / "grading.json", run_dir.parent / "grading.json"]:
+        if candidate.exists():
+            try:
+                grading = json.loads(candidate.read_text())
+            except (json.JSONDecodeError, OSError):
+                pass
+            if grading:
+                break
+
+    return {
+        "id": run_id,
+        "prompt": prompt,
+        "eval_id": eval_id,
+        "outputs": output_files,
+        "grading": grading,
+    }
+
+
+def embed_file(path: Path) -> dict:
+    """Read a file and return an embedded representation."""
+    ext = path.suffix.lower()
+    mime = get_mime_type(path)
+
+    if ext in TEXT_EXTENSIONS:
+        try:
+            content = path.read_text(errors="replace")
+        except OSError:
+            content = "(Error reading file)"
+        return {
+            "name": path.name,
+            "type": "text",
+            "content": content,
+        }
+    elif ext in IMAGE_EXTENSIONS:
+        try:
+            raw = path.read_bytes()
+            b64 = base64.b64encode(raw).decode("ascii")
+        except OSError:
+            return {"name": path.name, "type": "error", "content": "(Error reading file)"}
+        return {
+            "name": path.name,
+            "type": "image",
+            "mime": mime,
+            "data_uri": f"data:{mime};base64,{b64}",
+        }
+    elif ext == ".pdf":
+        try:
+            raw = path.read_bytes()
+            b64 = base64.b64encode(raw).decode("ascii")
+        except OSError:
+            return {"name": path.name, "type": "error", "content": "(Error reading file)"}
+        return {
+            "name": path.name,
+            "type": "pdf",
+            "data_uri": f"data:{mime};base64,{b64}",
+        }
+    elif ext == ".xlsx":
+        try:
+            raw = path.read_bytes()
+            b64 = base64.b64encode(raw).decode("ascii")
+        except OSError:
+            return {"name": path.name, "type": "error", "content": "(Error reading file)"}
+        return {
+            "name": path.name,
+            "type": "xlsx",
+            "data_b64": b64,
+        }
+    else:
+        # Binary / unknown — base64 download link
+        try:
+            raw = path.read_bytes()
+            b64 = base64.b64encode(raw).decode("ascii")
+        except OSError:
+            return {"name": path.name, "type": "error", "content": "(Error reading file)"}
+        return {
+            "name": path.name,
+            "type": "binary",
+            "mime": mime,
+            "data_uri": f"data:{mime};base64,{b64}",
+        }
+
+
+def load_previous_iteration(workspace: Path) -> dict[str, dict]:
+    """Load previous iteration's feedback and outputs.
+
+    Returns a map of run_id -> {"feedback": str, "outputs": list[dict]}.
+    """
+    result: dict[str, dict] = {}
+
+    # Load feedback
+    feedback_map: dict[str, str] = {}
+    feedback_path = workspace / "feedback.json"
+    if feedback_path.exists():
+        try:
+            data = json.loads(feedback_path.read_text())
+            feedback_map = {
+                r["run_id"]: r["feedback"]
+                for r in data.get("reviews", [])
+                if r.get("feedback", "").strip()
+            }
+        except (json.JSONDecodeError, OSError, KeyError):
+            pass
+
+    # Load runs (to get outputs)
+    prev_runs = find_runs(workspace)
+    for run in prev_runs:
+        result[run["id"]] = {
+            "feedback": feedback_map.get(run["id"], ""),
+            "outputs": run.get("outputs", []),
+        }
+
+    # Also add feedback for run_ids that had feedback but no matching run
+    for run_id, fb in feedback_map.items():
+        if run_id not in result:
+            result[run_id] = {"feedback": fb, "outputs": []}
+
+    return result
+
+
+def generate_html(
+    runs: list[dict],
+    skill_name: str,
+    previous: dict[str, dict] | None = None,
+    benchmark: dict | None = None,
+) -> str:
+    """Generate the complete standalone HTML page with embedded data."""
+    template_path = Path(__file__).parent / "viewer.html"
+    template = template_path.read_text()
+
+    # Build previous_feedback and previous_outputs maps for the template
+    previous_feedback: dict[str, str] = {}
+    previous_outputs: dict[str, list[dict]] = {}
+    if previous:
+        for run_id, data in previous.items():
+            if data.get("feedback"):
+                previous_feedback[run_id] = data["feedback"]
+            if data.get("outputs"):
+                previous_outputs[run_id] = data["outputs"]
+
+    embedded = {
+        "skill_name": skill_name,
+        "runs": runs,
+        "previous_feedback": previous_feedback,
+        "previous_outputs": previous_outputs,
+    }
+    if benchmark:
+        embedded["benchmark"] = benchmark
+
+    data_json = json.dumps(embedded)
+
+    return template.replace("/*__EMBEDDED_DATA__*/", f"const EMBEDDED_DATA = {data_json};")
+
+
+# ---------------------------------------------------------------------------
+# HTTP server (stdlib only, zero dependencies)
+# ---------------------------------------------------------------------------
+
+def _kill_port(port: int) -> None:
+    """Kill any process listening on the given port."""
+    try:
+        result = subprocess.run(
+            ["lsof", "-ti", f":{port}"],
+            capture_output=True, text=True, timeout=5,
+        )
+        for pid_str in result.stdout.strip().split("\n"):
+            if pid_str.strip():
+                try:
+                    os.kill(int(pid_str.strip()), signal.SIGTERM)
+                except (ProcessLookupError, ValueError):
+                    pass
+        if result.stdout.strip():
+            time.sleep(0.5)
+    except subprocess.TimeoutExpired:
+        pass
+    except FileNotFoundError:
+        print("Note: lsof not found, cannot check if port is in use", file=sys.stderr)
+
+class ReviewHandler(BaseHTTPRequestHandler):
+    """Serves the review HTML and handles feedback saves.
+
+    Regenerates the HTML on each page load so that refreshing the browser
+    picks up new eval outputs without restarting the server.
+    """
+
+    def __init__(
+        self,
+        workspace: Path,
+        skill_name: str,
+        feedback_path: Path,
+        previous: dict[str, dict],
+        benchmark_path: Path | None,
+        *args,
+        **kwargs,
+    ):
+        self.workspace = workspace
+        self.skill_name = skill_name
+        self.feedback_path = feedback_path
+        self.previous = previous
+        self.benchmark_path = benchmark_path
+        super().__init__(*args, **kwargs)
+
+    def do_GET(self) -> None:
+        if self.path == "/" or self.path == "/index.html":
+            # Regenerate HTML on each request (re-scans workspace for new outputs)
+            runs = find_runs(self.workspace)
+            benchmark = None
+            if self.benchmark_path and self.benchmark_path.exists():
+                try:
+                    benchmark = json.loads(self.benchmark_path.read_text())
+                except (json.JSONDecodeError, OSError):
+                    pass
+            html = generate_html(runs, self.skill_name, self.previous, benchmark)
+            content = html.encode("utf-8")
+            self.send_response(200)
+            self.send_header("Content-Type", "text/html; charset=utf-8")
+            self.send_header("Content-Length", str(len(content)))
+            self.end_headers()
+            self.wfile.write(content)
+        elif self.path == "/api/feedback":
+            data = b"{}"
+            if self.feedback_path.exists():
+                data = self.feedback_path.read_bytes()
+            self.send_response(200)
+            self.send_header("Content-Type", "application/json")
+            self.send_header("Content-Length", str(len(data)))
+            self.end_headers()
+            self.wfile.write(data)
+        else:
+            self.send_error(404)
+
+    def do_POST(self) -> None:
+        if self.path == "/api/feedback":
+            length = int(self.headers.get("Content-Length", 0))
+            body = self.rfile.read(length)
+            try:
+                data = json.loads(body)
+                if not isinstance(data, dict) or "reviews" not in data:
+                    raise ValueError("Expected JSON object with 'reviews' key")
+                self.feedback_path.write_text(json.dumps(data, indent=2) + "\n")
+                resp = b'{"ok":true}'
+                self.send_response(200)
+            except (json.JSONDecodeError, OSError, ValueError) as e:
+                resp = json.dumps({"error": str(e)}).encode()
+                self.send_response(500)
+            self.send_header("Content-Type", "application/json")
+            self.send_header("Content-Length", str(len(resp)))
+            self.end_headers()
+            self.wfile.write(resp)
+        else:
+            self.send_error(404)
+
+    def log_message(self, format: str, *args: object) -> None:
+        # Suppress request logging to keep terminal clean
+        pass
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Generate and serve eval review")
+    parser.add_argument("workspace", type=Path, help="Path to workspace directory")
+    parser.add_argument("--port", "-p", type=int, default=3117, help="Server port (default: 3117)")
+    parser.add_argument("--skill-name", "-n", type=str, default=None, help="Skill name for header")
+    parser.add_argument(
+        "--previous-workspace", type=Path, default=None,
+        help="Path to previous iteration's workspace (shows old outputs and feedback as context)",
+    )
+    parser.add_argument(
+        "--benchmark", type=Path, default=None,
+        help="Path to benchmark.json to show in the Benchmark tab",
+    )
+    parser.add_argument(
+        "--static", "-s", type=Path, default=None,
+        help="Write standalone HTML to this path instead of starting a server",
+    )
+    args = parser.parse_args()
+
+    workspace = args.workspace.resolve()
+    if not workspace.is_dir():
+        print(f"Error: {workspace} is not a directory", file=sys.stderr)
+        sys.exit(1)
+
+    runs = find_runs(workspace)
+    if not runs:
+        print(f"No runs found in {workspace}", file=sys.stderr)
+        sys.exit(1)
+
+    skill_name = args.skill_name or workspace.name.replace("-workspace", "")
+    feedback_path = workspace / "feedback.json"
+
+    previous: dict[str, dict] = {}
+    if args.previous_workspace:
+        previous = load_previous_iteration(args.previous_workspace.resolve())
+
+    benchmark_path = args.benchmark.resolve() if args.benchmark else None
+    benchmark = None
+    if benchmark_path and benchmark_path.exists():
+        try:
+            benchmark = json.loads(benchmark_path.read_text())
+        except (json.JSONDecodeError, OSError):
+            pass
+
+    if args.static:
+        html = generate_html(runs, skill_name, previous, benchmark)
+        args.static.parent.mkdir(parents=True, exist_ok=True)
+        args.static.write_text(html)
+        print(f"\n  Static viewer written to: {args.static}\n")
+        sys.exit(0)
+
+    # Kill any existing process on the target port
+    port = args.port
+    _kill_port(port)
+    handler = partial(ReviewHandler, workspace, skill_name, feedback_path, previous, benchmark_path)
+    try:
+        server = HTTPServer(("127.0.0.1", port), handler)
+    except OSError:
+        # Port still in use after kill attempt — find a free one
+        server = HTTPServer(("127.0.0.1", 0), handler)
+        port = server.server_address[1]
+
+    url = f"http://localhost:{port}"
+    print(f"\n  Eval Viewer")
+    print(f"  ─────────────────────────────────")
+    print(f"  URL:       {url}")
+    print(f"  Workspace: {workspace}")
+    print(f"  Feedback:  {feedback_path}")
+    if previous:
+        print(f"  Previous:  {args.previous_workspace} ({len(previous)} runs)")
+    if benchmark_path:
+        print(f"  Benchmark: {benchmark_path}")
+    print(f"\n  Press Ctrl+C to stop.\n")
+
+    webbrowser.open(url)
+
+    try:
+        server.serve_forever()
+    except KeyboardInterrupt:
+        print("\nStopped.")
+        server.server_close()
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/public/skill-creator/eval-viewer/viewer.html
+++ b/skills/public/skill-creator/eval-viewer/viewer.html
--- a/skills/public/skill-creator/references/schemas.md
+++ b/skills/public/skill-creator/references/schemas.md
@@ -0,0 +1,430 @@
+# JSON Schemas
+
+This document defines the JSON schemas used by skill-creator.
+
+---
+
+## evals.json
+
+Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.
+
+```json
+{
+  "skill_name": "example-skill",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "User's example prompt",
+      "expected_output": "Description of expected result",
+      "files": ["evals/files/sample1.pdf"],
+      "expectations": [
+        "The output includes X",
+        "The skill used script Y"
+      ]
+    }
+  ]
+}
+```
+
+**Fields:**
+- `skill_name`: Name matching the skill's frontmatter
+- `evals[].id`: Unique integer identifier
+- `evals[].prompt`: The task to execute
+- `evals[].expected_output`: Human-readable description of success
+- `evals[].files`: Optional list of input file paths (relative to skill root)
+- `evals[].expectations`: List of verifiable statements
+
+---
+
+## history.json
+
+Tracks version progression in Improve mode. Located at workspace root.
+
+```json
+{
+  "started_at": "2026-01-15T10:30:00Z",
+  "skill_name": "pdf",
+  "current_best": "v2",
+  "iterations": [
+    {
+      "version": "v0",
+      "parent": null,
+      "expectation_pass_rate": 0.65,
+      "grading_result": "baseline",
+      "is_current_best": false
+    },
+    {
+      "version": "v1",
+      "parent": "v0",
+      "expectation_pass_rate": 0.75,
+      "grading_result": "won",
+      "is_current_best": false
+    },
+    {
+      "version": "v2",
+      "parent": "v1",
+      "expectation_pass_rate": 0.85,
+      "grading_result": "won",
+      "is_current_best": true
+    }
+  ]
+}
+```
+
+**Fields:**
+- `started_at`: ISO timestamp of when improvement started
+- `skill_name`: Name of the skill being improved
+- `current_best`: Version identifier of the best performer
+- `iterations[].version`: Version identifier (v0, v1, ...)
+- `iterations[].parent`: Parent version this was derived from
+- `iterations[].expectation_pass_rate`: Pass rate from grading
+- `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
+- `iterations[].is_current_best`: Whether this is the current best version
+
+---
+
+## grading.json
+
+Output from the grader agent. Located at `<run-dir>/grading.json`.
+
+```json
+{
+  "expectations": [
+    {
+      "text": "The output includes the name 'John Smith'",
+      "passed": true,
+      "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
+    },
+    {
+      "text": "The spreadsheet has a SUM formula in cell B10",
+      "passed": false,
+      "evidence": "No spreadsheet was created. The output was a text file."
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 1,
+    "total": 3,
+    "pass_rate": 0.67
+  },
+  "execution_metrics": {
+    "tool_calls": {
+      "Read": 5,
+      "Write": 2,
+      "Bash": 8
+    },
+    "total_tool_calls": 15,
+    "total_steps": 6,
+    "errors_encountered": 0,
+    "output_chars": 12450,
+    "transcript_chars": 3200
+  },
+  "timing": {
+    "executor_duration_seconds": 165.0,
+    "grader_duration_seconds": 26.0,
+    "total_duration_seconds": 191.0
+  },
+  "claims": [
+    {
+      "claim": "The form has 12 fillable fields",
+      "type": "factual",
+      "verified": true,
+      "evidence": "Counted 12 fields in field_info.json"
+    }
+  ],
+  "user_notes_summary": {
+    "uncertainties": ["Used 2023 data, may be stale"],
+    "needs_review": [],
+    "workarounds": ["Fell back to text overlay for non-fillable fields"]
+  },
+  "eval_feedback": {
+    "suggestions": [
+      {
+        "assertion": "The output includes the name 'John Smith'",
+        "reason": "A hallucinated document that mentions the name would also pass"
+      }
+    ],
+    "overall": "Assertions check presence but not correctness."
+  }
+}
+```
+
+**Fields:**
+- `expectations[]`: Graded expectations with evidence
+- `summary`: Aggregate pass/fail counts
+- `execution_metrics`: Tool usage and output size (from executor's metrics.json)
+- `timing`: Wall clock timing (from timing.json)
+- `claims`: Extracted and verified claims from the output
+- `user_notes_summary`: Issues flagged by the executor
+- `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising
+
+---
+
+## metrics.json
+
+Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
+
+```json
+{
+  "tool_calls": {
+    "Read": 5,
+    "Write": 2,
+    "Bash": 8,
+    "Edit": 1,
+    "Glob": 2,
+    "Grep": 0
+  },
+  "total_tool_calls": 18,
+  "total_steps": 6,
+  "files_created": ["filled_form.pdf", "field_values.json"],
+  "errors_encountered": 0,
+  "output_chars": 12450,
+  "transcript_chars": 3200
+}
+```
+
+**Fields:**
+- `tool_calls`: Count per tool type
+- `total_tool_calls`: Sum of all tool calls
+- `total_steps`: Number of major execution steps
+- `files_created`: List of output files created
+- `errors_encountered`: Number of errors during execution
+- `output_chars`: Total character count of output files
+- `transcript_chars`: Character count of transcript
+
+---
+
+## timing.json
+
+Wall clock timing for a run. Located at `<run-dir>/timing.json`.
+
+**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
+
+```json
+{
+  "total_tokens": 84852,
+  "duration_ms": 23332,
+  "total_duration_seconds": 23.3,
+  "executor_start": "2026-01-15T10:30:00Z",
+  "executor_end": "2026-01-15T10:32:45Z",
+  "executor_duration_seconds": 165.0,
+  "grader_start": "2026-01-15T10:32:46Z",
+  "grader_end": "2026-01-15T10:33:12Z",
+  "grader_duration_seconds": 26.0
+}
+```
+
+---
+
+## benchmark.json
+
+Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
+
+```json
+{
+  "metadata": {
+    "skill_name": "pdf",
+    "skill_path": "/path/to/pdf",
+    "executor_model": "claude-sonnet-4-20250514",
+    "analyzer_model": "most-capable-model",
+    "timestamp": "2026-01-15T10:30:00Z",
+    "evals_run": [1, 2, 3],
+    "runs_per_configuration": 3
+  },
+
+  "runs": [
+    {
+      "eval_id": 1,
+      "eval_name": "Ocean",
+      "configuration": "with_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 0.85,
+        "passed": 6,
+        "failed": 1,
+        "total": 7,
+        "time_seconds": 42.5,
+        "tokens": 3800,
+        "tool_calls": 18,
+        "errors": 0
+      },
+      "expectations": [
+        {"text": "...", "passed": true, "evidence": "..."}
+      ],
+      "notes": [
+        "Used 2023 data, may be stale",
+        "Fell back to text overlay for non-fillable fields"
+      ]
+    }
+  ],
+
+  "run_summary": {
+    "with_skill": {
+      "pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},
+      "time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},
+      "tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}
+    },
+    "without_skill": {
+      "pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},
+      "time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},
+      "tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}
+    },
+    "delta": {
+      "pass_rate": "+0.50",
+      "time_seconds": "+13.0",
+      "tokens": "+1700"
+    }
+  },
+
+  "notes": [
+    "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
+    "Eval 3 shows high variance (50% ± 40%) - may be flaky or model-dependent",
+    "Without-skill runs consistently fail on table extraction expectations",
+    "Skill adds 13s average execution time but improves pass rate by 50%"
+  ]
+}
+```
+
+**Fields:**
+- `metadata`: Information about the benchmark run
+  - `skill_name`: Name of the skill
+  - `timestamp`: When the benchmark was run
+  - `evals_run`: List of eval names or IDs
+  - `runs_per_configuration`: Number of runs per config (e.g. 3)
+- `runs[]`: Individual run results
+  - `eval_id`: Numeric eval identifier
+  - `eval_name`: Human-readable eval name (used as section header in the viewer)
+  - `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
+  - `run_number`: Integer run number (1, 2, 3...)
+  - `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
+- `run_summary`: Statistical aggregates per configuration
+  - `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
+  - `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
+- `notes`: Freeform observations from the analyzer
+
+**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
+
+---
+
+## comparison.json
+
+Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
+
+```json
+{
+  "winner": "A",
+  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
+  "rubric": {
+    "A": {
+      "content": {
+        "correctness": 5,
+        "completeness": 5,
+        "accuracy": 4
+      },
+      "structure": {
+        "organization": 4,
+        "formatting": 5,
+        "usability": 4
+      },
+      "content_score": 4.7,
+      "structure_score": 4.3,
+      "overall_score": 9.0
+    },
+    "B": {
+      "content": {
+        "correctness": 3,
+        "completeness": 2,
+        "accuracy": 3
+      },
+      "structure": {
+        "organization": 3,
+        "formatting": 2,
+        "usability": 3
+      },
+      "content_score": 2.7,
+      "structure_score": 2.7,
+      "overall_score": 5.4
+    }
+  },
+  "output_quality": {
+    "A": {
+      "score": 9,
+      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
+      "weaknesses": ["Minor style inconsistency in header"]
+    },
+    "B": {
+      "score": 5,
+      "strengths": ["Readable output", "Correct basic structure"],
+      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
+    }
+  },
+  "expectation_results": {
+    "A": {
+      "passed": 4,
+      "total": 5,
+      "pass_rate": 0.80,
+      "details": [
+        {"text": "Output includes name", "passed": true}
+      ]
+    },
+    "B": {
+      "passed": 3,
+      "total": 5,
+      "pass_rate": 0.60,
+      "details": [
+        {"text": "Output includes name", "passed": true}
+      ]
+    }
+  }
+}
+```
+
+---
+
+## analysis.json
+
+Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.
+
+```json
+{
+  "comparison_summary": {
+    "winner": "A",
+    "winner_skill": "path/to/winner/skill",
+    "loser_skill": "path/to/loser/skill",
+    "comparator_reasoning": "Brief summary of why comparator chose winner"
+  },
+  "winner_strengths": [
+    "Clear step-by-step instructions for handling multi-page documents",
+    "Included validation script that caught formatting errors"
+  ],
+  "loser_weaknesses": [
+    "Vague instruction 'process the document appropriately' led to inconsistent behavior",
+    "No script for validation, agent had to improvise"
+  ],
+  "instruction_following": {
+    "winner": {
+      "score": 9,
+      "issues": ["Minor: skipped optional logging step"]
+    },
+    "loser": {
+      "score": 6,
+      "issues": [
+        "Did not use the skill's formatting template",
+        "Invented own approach instead of following step 3"
+      ]
+    }
+  },
+  "improvement_suggestions": [
+    {
+      "priority": "high",
+      "category": "instructions",
+      "suggestion": "Replace 'process the document appropriately' with explicit steps",
+      "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
+    }
+  ],
+  "transcript_insights": {
+    "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",
+    "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"
+  }
+}
+```
--- a/skills/public/skill-creator/scripts/aggregate_benchmark.py
+++ b/skills/public/skill-creator/scripts/aggregate_benchmark.py
@@ -0,0 +1,401 @@
+#!/usr/bin/env python3
+"""
+Aggregate individual run results into benchmark summary statistics.
+
+Reads grading.json files from run directories and produces:
+- run_summary with mean, stddev, min, max for each metric
+- delta between with_skill and without_skill configurations
+
+Usage:
+    python aggregate_benchmark.py <benchmark_dir>
+
+Example:
+    python aggregate_benchmark.py benchmarks/2026-01-15T10-30-00/
+
+The script supports two directory layouts:
+
+    Workspace layout (from skill-creator iterations):
+    <benchmark_dir>/
+    └── eval-N/
+        ├── with_skill/
+        │   ├── run-1/grading.json
+        │   └── run-2/grading.json
+        └── without_skill/
+            ├── run-1/grading.json
+            └── run-2/grading.json
+
+    Legacy layout (with runs/ subdirectory):
+    <benchmark_dir>/
+    └── runs/
+        └── eval-N/
+            ├── with_skill/
+            │   └── run-1/grading.json
+            └── without_skill/
+                └── run-1/grading.json
+"""
+
+import argparse
+import json
+import math
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+
+
+def calculate_stats(values: list[float]) -> dict:
+    """Calculate mean, stddev, min, max for a list of values."""
+    if not values:
+        return {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0}
+
+    n = len(values)
+    mean = sum(values) / n
+
+    if n > 1:
+        variance = sum((x - mean) ** 2 for x in values) / (n - 1)
+        stddev = math.sqrt(variance)
+    else:
+        stddev = 0.0
+
+    return {
+        "mean": round(mean, 4),
+        "stddev": round(stddev, 4),
+        "min": round(min(values), 4),
+        "max": round(max(values), 4)
+    }
+
+
+def load_run_results(benchmark_dir: Path) -> dict:
+    """
+    Load all run results from a benchmark directory.
+
+    Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
+    or "new_skill"/"old_skill"), each containing a list of run results.
+    """
+    # Support both layouts: eval dirs directly under benchmark_dir, or under runs/
+    runs_dir = benchmark_dir / "runs"
+    if runs_dir.exists():
+        search_dir = runs_dir
+    elif list(benchmark_dir.glob("eval-*")):
+        search_dir = benchmark_dir
+    else:
+        print(f"No eval directories found in {benchmark_dir} or {benchmark_dir / 'runs'}")
+        return {}
+
+    results: dict[str, list] = {}
+
+    for eval_idx, eval_dir in enumerate(sorted(search_dir.glob("eval-*"))):
+        metadata_path = eval_dir / "eval_metadata.json"
+        if metadata_path.exists():
+            try:
+                with open(metadata_path) as mf:
+                    eval_id = json.load(mf).get("eval_id", eval_idx)
+            except (json.JSONDecodeError, OSError):
+                eval_id = eval_idx
+        else:
+            try:
+                eval_id = int(eval_dir.name.split("-")[1])
+            except ValueError:
+                eval_id = eval_idx
+
+        # Discover config directories dynamically rather than hardcoding names
+        for config_dir in sorted(eval_dir.iterdir()):
+            if not config_dir.is_dir():
+                continue
+            # Skip non-config directories (inputs, outputs, etc.)
+            if not list(config_dir.glob("run-*")):
+                continue
+            config = config_dir.name
+            if config not in results:
+                results[config] = []
+
+            for run_dir in sorted(config_dir.glob("run-*")):
+                run_number = int(run_dir.name.split("-")[1])
+                grading_file = run_dir / "grading.json"
+
+                if not grading_file.exists():
+                    print(f"Warning: grading.json not found in {run_dir}")
+                    continue
+
+                try:
+                    with open(grading_file) as f:
+                        grading = json.load(f)
+                except json.JSONDecodeError as e:
+                    print(f"Warning: Invalid JSON in {grading_file}: {e}")
+                    continue
+
+                # Extract metrics
+                result = {
+                    "eval_id": eval_id,
+                    "run_number": run_number,
+                    "pass_rate": grading.get("summary", {}).get("pass_rate", 0.0),
+                    "passed": grading.get("summary", {}).get("passed", 0),
+                    "failed": grading.get("summary", {}).get("failed", 0),
+                    "total": grading.get("summary", {}).get("total", 0),
+                }
+
+                # Extract timing — check grading.json first, then sibling timing.json
+                timing = grading.get("timing", {})
+                result["time_seconds"] = timing.get("total_duration_seconds", 0.0)
+                timing_file = run_dir / "timing.json"
+                if result["time_seconds"] == 0.0 and timing_file.exists():
+                    try:
+                        with open(timing_file) as tf:
+                            timing_data = json.load(tf)
+                        result["time_seconds"] = timing_data.get("total_duration_seconds", 0.0)
+                        result["tokens"] = timing_data.get("total_tokens", 0)
+                    except json.JSONDecodeError:
+                        pass
+
+                # Extract metrics if available
+                metrics = grading.get("execution_metrics", {})
+                result["tool_calls"] = metrics.get("total_tool_calls", 0)
+                if not result.get("tokens"):
+                    result["tokens"] = metrics.get("output_chars", 0)
+                result["errors"] = metrics.get("errors_encountered", 0)
+
+                # Extract expectations — viewer requires fields: text, passed, evidence
+                raw_expectations = grading.get("expectations", [])
+                for exp in raw_expectations:
+                    if "text" not in exp or "passed" not in exp:
+                        print(f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}")
+                result["expectations"] = raw_expectations
+
+                # Extract notes from user_notes_summary
+                notes_summary = grading.get("user_notes_summary", {})
+                notes = []
+                notes.extend(notes_summary.get("uncertainties", []))
+                notes.extend(notes_summary.get("needs_review", []))
+                notes.extend(notes_summary.get("workarounds", []))
+                result["notes"] = notes
+
+                results[config].append(result)
+
+    return results
+
+
+def aggregate_results(results: dict) -> dict:
+    """
+    Aggregate run results into summary statistics.
+
+    Returns run_summary with stats for each configuration and delta.
+    """
+    run_summary = {}
+    configs = list(results.keys())
+
+    for config in configs:
+        runs = results.get(config, [])
+
+        if not runs:
+            run_summary[config] = {
+                "pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
+                "time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
+                "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
+            }
+            continue
+
+        pass_rates = [r["pass_rate"] for r in runs]
+        times = [r["time_seconds"] for r in runs]
+        tokens = [r.get("tokens", 0) for r in runs]
+
+        run_summary[config] = {
+            "pass_rate": calculate_stats(pass_rates),
+            "time_seconds": calculate_stats(times),
+            "tokens": calculate_stats(tokens)
+        }
+
+    # Calculate delta between the first two configs (if two exist)
+    if len(configs) >= 2:
+        primary = run_summary.get(configs[0], {})
+        baseline = run_summary.get(configs[1], {})
+    else:
+        primary = run_summary.get(configs[0], {}) if configs else {}
+        baseline = {}
+
+    delta_pass_rate = primary.get("pass_rate", {}).get("mean", 0) - baseline.get("pass_rate", {}).get("mean", 0)
+    delta_time = primary.get("time_seconds", {}).get("mean", 0) - baseline.get("time_seconds", {}).get("mean", 0)
+    delta_tokens = primary.get("tokens", {}).get("mean", 0) - baseline.get("tokens", {}).get("mean", 0)
+
+    run_summary["delta"] = {
+        "pass_rate": f"{delta_pass_rate:+.2f}",
+        "time_seconds": f"{delta_time:+.1f}",
+        "tokens": f"{delta_tokens:+.0f}"
+    }
+
+    return run_summary
+
+
+def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: str = "") -> dict:
+    """
+    Generate complete benchmark.json from run results.
+    """
+    results = load_run_results(benchmark_dir)
+    run_summary = aggregate_results(results)
+
+    # Build runs array for benchmark.json
+    runs = []
+    for config in results:
+        for result in results[config]:
+            runs.append({
+                "eval_id": result["eval_id"],
+                "configuration": config,
+                "run_number": result["run_number"],
+                "result": {
+                    "pass_rate": result["pass_rate"],
+                    "passed": result["passed"],
+                    "failed": result["failed"],
+                    "total": result["total"],
+                    "time_seconds": result["time_seconds"],
+                    "tokens": result.get("tokens", 0),
+                    "tool_calls": result.get("tool_calls", 0),
+                    "errors": result.get("errors", 0)
+                },
+                "expectations": result["expectations"],
+                "notes": result["notes"]
+            })
+
+    # Determine eval IDs from results
+    eval_ids = sorted(set(
+        r["eval_id"]
+        for config in results.values()
+        for r in config
+    ))
+
+    benchmark = {
+        "metadata": {
+            "skill_name": skill_name or "<skill-name>",
+            "skill_path": skill_path or "<path/to/skill>",
+            "executor_model": "<model-name>",
+            "analyzer_model": "<model-name>",
+            "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+            "evals_run": eval_ids,
+            "runs_per_configuration": 3
+        },
+        "runs": runs,
+        "run_summary": run_summary,
+        "notes": []  # To be filled by analyzer
+    }
+
+    return benchmark
+
+
+def generate_markdown(benchmark: dict) -> str:
+    """Generate human-readable benchmark.md from benchmark data."""
+    metadata = benchmark["metadata"]
+    run_summary = benchmark["run_summary"]
+
+    # Determine config names (excluding "delta")
+    configs = [k for k in run_summary if k != "delta"]
+    config_a = configs[0] if len(configs) >= 1 else "config_a"
+    config_b = configs[1] if len(configs) >= 2 else "config_b"
+    label_a = config_a.replace("_", " ").title()
+    label_b = config_b.replace("_", " ").title()
+
+    lines = [
+        f"# Skill Benchmark: {metadata['skill_name']}",
+        "",
+        f"**Model**: {metadata['executor_model']}",
+        f"**Date**: {metadata['timestamp']}",
+        f"**Evals**: {', '.join(map(str, metadata['evals_run']))} ({metadata['runs_per_configuration']} runs each per configuration)",
+        "",
+        "## Summary",
+        "",
+        f"| Metric | {label_a} | {label_b} | Delta |",
+        "|--------|------------|---------------|-------|",
+    ]
+
+    a_summary = run_summary.get(config_a, {})
+    b_summary = run_summary.get(config_b, {})
+    delta = run_summary.get("delta", {})
+
+    # Format pass rate
+    a_pr = a_summary.get("pass_rate", {})
+    b_pr = b_summary.get("pass_rate", {})
+    lines.append(f"| Pass Rate | {a_pr.get('mean', 0)*100:.0f}% ± {a_pr.get('stddev', 0)*100:.0f}% | {b_pr.get('mean', 0)*100:.0f}% ± {b_pr.get('stddev', 0)*100:.0f}% | {delta.get('pass_rate', '—')} |")
+
+    # Format time
+    a_time = a_summary.get("time_seconds", {})
+    b_time = b_summary.get("time_seconds", {})
+    lines.append(f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |")
+
+    # Format tokens
+    a_tokens = a_summary.get("tokens", {})
+    b_tokens = b_summary.get("tokens", {})
+    lines.append(f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |")
+
+    # Notes section
+    if benchmark.get("notes"):
+        lines.extend([
+            "",
+            "## Notes",
+            ""
+        ])
+        for note in benchmark["notes"]:
+            lines.append(f"- {note}")
+
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Aggregate benchmark run results into summary statistics"
+    )
+    parser.add_argument(
+        "benchmark_dir",
+        type=Path,
+        help="Path to the benchmark directory"
+    )
+    parser.add_argument(
+        "--skill-name",
+        default="",
+        help="Name of the skill being benchmarked"
+    )
+    parser.add_argument(
+        "--skill-path",
+        default="",
+        help="Path to the skill being benchmarked"
+    )
+    parser.add_argument(
+        "--output", "-o",
+        type=Path,
+        help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
+    )
+
+    args = parser.parse_args()
+
+    if not args.benchmark_dir.exists():
+        print(f"Directory not found: {args.benchmark_dir}")
+        sys.exit(1)
+
+    # Generate benchmark
+    benchmark = generate_benchmark(args.benchmark_dir, args.skill_name, args.skill_path)
+
+    # Determine output paths
+    output_json = args.output or (args.benchmark_dir / "benchmark.json")
+    output_md = output_json.with_suffix(".md")
+
+    # Write benchmark.json
+    with open(output_json, "w") as f:
+        json.dump(benchmark, f, indent=2)
+    print(f"Generated: {output_json}")
+
+    # Write benchmark.md
+    markdown = generate_markdown(benchmark)
+    with open(output_md, "w") as f:
+        f.write(markdown)
+    print(f"Generated: {output_md}")
+
+    # Print summary
+    run_summary = benchmark["run_summary"]
+    configs = [k for k in run_summary if k != "delta"]
+    delta = run_summary.get("delta", {})
+
+    print(f"\nSummary:")
+    for config in configs:
+        pr = run_summary[config]["pass_rate"]["mean"]
+        label = config.replace("_", " ").title()
+        print(f"  {label}: {pr*100:.1f}% pass rate")
+    print(f"  Delta:         {delta.get('pass_rate', '—')}")
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/public/skill-creator/scripts/generate_report.py
+++ b/skills/public/skill-creator/scripts/generate_report.py
@@ -0,0 +1,326 @@
+#!/usr/bin/env python3
+"""Generate an HTML report from run_loop.py output.
+
+Takes the JSON output from run_loop.py and generates a visual HTML report
+showing each description attempt with check/x for each test case.
+Distinguishes between train and test queries.
+"""
+
+import argparse
+import html
+import json
+import sys
+from pathlib import Path
+
+
+def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "") -> str:
+    """Generate HTML report from loop output data. If auto_refresh is True, adds a meta refresh tag."""
+    history = data.get("history", [])
+    holdout = data.get("holdout", 0)
+    title_prefix = html.escape(skill_name + " \u2014 ") if skill_name else ""
+
+    # Get all unique queries from train and test sets, with should_trigger info
+    train_queries: list[dict] = []
+    test_queries: list[dict] = []
+    if history:
+        for r in history[0].get("train_results", history[0].get("results", [])):
+            train_queries.append({"query": r["query"], "should_trigger": r.get("should_trigger", True)})
+        if history[0].get("test_results"):
+            for r in history[0].get("test_results", []):
+                test_queries.append({"query": r["query"], "should_trigger": r.get("should_trigger", True)})
+
+    refresh_tag = '    <meta http-equiv="refresh" content="5">\n' if auto_refresh else ""
+
+    html_parts = ["""<!DOCTYPE html>
+<html>
+<head>
+    <meta charset="utf-8">
+""" + refresh_tag + """    <title>""" + title_prefix + """Skill Description Optimization</title>
+    <link rel="preconnect" href="https://fonts.googleapis.com">
+    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+    <link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
+    <style>
+        body {
+            font-family: 'Lora', Georgia, serif;
+            max-width: 100%;
+            margin: 0 auto;
+            padding: 20px;
+            background: #faf9f5;
+            color: #141413;
+        }
+        h1 { font-family: 'Poppins', sans-serif; color: #141413; }
+        .explainer {
+            background: white;
+            padding: 15px;
+            border-radius: 6px;
+            margin-bottom: 20px;
+            border: 1px solid #e8e6dc;
+            color: #b0aea5;
+            font-size: 0.875rem;
+            line-height: 1.6;
+        }
+        .summary {
+            background: white;
+            padding: 15px;
+            border-radius: 6px;
+            margin-bottom: 20px;
+            border: 1px solid #e8e6dc;
+        }
+        .summary p { margin: 5px 0; }
+        .best { color: #788c5d; font-weight: bold; }
+        .table-container {
+            overflow-x: auto;
+            width: 100%;
+        }
+        table {
+            border-collapse: collapse;
+            background: white;
+            border: 1px solid #e8e6dc;
+            border-radius: 6px;
+            font-size: 12px;
+            min-width: 100%;
+        }
+        th, td {
+            padding: 8px;
+            text-align: left;
+            border: 1px solid #e8e6dc;
+            white-space: normal;
+            word-wrap: break-word;
+        }
+        th {
+            font-family: 'Poppins', sans-serif;
+            background: #141413;
+            color: #faf9f5;
+            font-weight: 500;
+        }
+        th.test-col {
+            background: #6a9bcc;
+        }
+        th.query-col { min-width: 200px; }
+        td.description {
+            font-family: monospace;
+            font-size: 11px;
+            word-wrap: break-word;
+            max-width: 400px;
+        }
+        td.result {
+            text-align: center;
+            font-size: 16px;
+            min-width: 40px;
+        }
+        td.test-result {
+            background: #f0f6fc;
+        }
+        .pass { color: #788c5d; }
+        .fail { color: #c44; }
+        .rate {
+            font-size: 9px;
+            color: #b0aea5;
+            display: block;
+        }
+        tr:hover { background: #faf9f5; }
+        .score {
+            display: inline-block;
+            padding: 2px 6px;
+            border-radius: 4px;
+            font-weight: bold;
+            font-size: 11px;
+        }
+        .score-good { background: #eef2e8; color: #788c5d; }
+        .score-ok { background: #fef3c7; color: #d97706; }
+        .score-bad { background: #fceaea; color: #c44; }
+        .train-label { color: #b0aea5; font-size: 10px; }
+        .test-label { color: #6a9bcc; font-size: 10px; font-weight: bold; }
+        .best-row { background: #f5f8f2; }
+        th.positive-col { border-bottom: 3px solid #788c5d; }
+        th.negative-col { border-bottom: 3px solid #c44; }
+        th.test-col.positive-col { border-bottom: 3px solid #788c5d; }
+        th.test-col.negative-col { border-bottom: 3px solid #c44; }
+        .legend { font-family: 'Poppins', sans-serif; display: flex; gap: 20px; margin-bottom: 10px; font-size: 13px; align-items: center; }
+        .legend-item { display: flex; align-items: center; gap: 6px; }
+        .legend-swatch { width: 16px; height: 16px; border-radius: 3px; display: inline-block; }
+        .swatch-positive { background: #141413; border-bottom: 3px solid #788c5d; }
+        .swatch-negative { background: #141413; border-bottom: 3px solid #c44; }
+        .swatch-test { background: #6a9bcc; }
+        .swatch-train { background: #141413; }
+    </style>
+</head>
+<body>
+    <h1>""" + title_prefix + """Skill Description Optimization</h1>
+    <div class="explainer">
+        <strong>Optimizing your skill's description.</strong> This page updates automatically as Claude tests different versions of your skill's description. Each row is an iteration — a new description attempt. The columns show test queries: green checkmarks mean the skill triggered correctly (or correctly didn't trigger), red crosses mean it got it wrong. The "Train" score shows performance on queries used to improve the description; the "Test" score shows performance on held-out queries the optimizer hasn't seen. When it's done, Claude will apply the best-performing description to your skill.
+    </div>
+"""]
+
+    # Summary section
+    best_test_score = data.get('best_test_score')
+    best_train_score = data.get('best_train_score')
+    html_parts.append(f"""
+    <div class="summary">
+        <p><strong>Original:</strong> {html.escape(data.get('original_description', 'N/A'))}</p>
+        <p class="best"><strong>Best:</strong> {html.escape(data.get('best_description', 'N/A'))}</p>
+        <p><strong>Best Score:</strong> {data.get('best_score', 'N/A')} {'(test)' if best_test_score else '(train)'}</p>
+        <p><strong>Iterations:</strong> {data.get('iterations_run', 0)} | <strong>Train:</strong> {data.get('train_size', '?')} | <strong>Test:</strong> {data.get('test_size', '?')}</p>
+    </div>
+""")
+
+    # Legend
+    html_parts.append("""
+    <div class="legend">
+        <span style="font-weight:600">Query columns:</span>
+        <span class="legend-item"><span class="legend-swatch swatch-positive"></span> Should trigger</span>
+        <span class="legend-item"><span class="legend-swatch swatch-negative"></span> Should NOT trigger</span>
+        <span class="legend-item"><span class="legend-swatch swatch-train"></span> Train</span>
+        <span class="legend-item"><span class="legend-swatch swatch-test"></span> Test</span>
+    </div>
+""")
+
+    # Table header
+    html_parts.append("""
+    <div class="table-container">
+    <table>
+        <thead>
+            <tr>
+                <th>Iter</th>
+                <th>Train</th>
+                <th>Test</th>
+                <th class="query-col">Description</th>
+""")
+
+    # Add column headers for train queries
+    for qinfo in train_queries:
+        polarity = "positive-col" if qinfo["should_trigger"] else "negative-col"
+        html_parts.append(f'                <th class="{polarity}">{html.escape(qinfo["query"])}</th>\n')
+
+    # Add column headers for test queries (different color)
+    for qinfo in test_queries:
+        polarity = "positive-col" if qinfo["should_trigger"] else "negative-col"
+        html_parts.append(f'                <th class="test-col {polarity}">{html.escape(qinfo["query"])}</th>\n')
+
+    html_parts.append("""            </tr>
+        </thead>
+        <tbody>
+""")
+
+    # Find best iteration for highlighting
+    if test_queries:
+        best_iter = max(history, key=lambda h: h.get("test_passed") or 0).get("iteration")
+    else:
+        best_iter = max(history, key=lambda h: h.get("train_passed", h.get("passed", 0))).get("iteration")
+
+    # Add rows for each iteration
+    for h in history:
+        iteration = h.get("iteration", "?")
+        train_passed = h.get("train_passed", h.get("passed", 0))
+        train_total = h.get("train_total", h.get("total", 0))
+        test_passed = h.get("test_passed")
+        test_total = h.get("test_total")
+        description = h.get("description", "")
+        train_results = h.get("train_results", h.get("results", []))
+        test_results = h.get("test_results", [])
+
+        # Create lookups for results by query
+        train_by_query = {r["query"]: r for r in train_results}
+        test_by_query = {r["query"]: r for r in test_results} if test_results else {}
+
+        # Compute aggregate correct/total runs across all retries
+        def aggregate_runs(results: list[dict]) -> tuple[int, int]:
+            correct = 0
+            total = 0
+            for r in results:
+                runs = r.get("runs", 0)
+                triggers = r.get("triggers", 0)
+                total += runs
+                if r.get("should_trigger", True):
+                    correct += triggers
+                else:
+                    correct += runs - triggers
+            return correct, total
+
+        train_correct, train_runs = aggregate_runs(train_results)
+        test_correct, test_runs = aggregate_runs(test_results)
+
+        # Determine score classes
+        def score_class(correct: int, total: int) -> str:
+            if total > 0:
+                ratio = correct / total
+                if ratio >= 0.8:
+                    return "score-good"
+                elif ratio >= 0.5:
+                    return "score-ok"
+            return "score-bad"
+
+        train_class = score_class(train_correct, train_runs)
+        test_class = score_class(test_correct, test_runs)
+
+        row_class = "best-row" if iteration == best_iter else ""
+
+        html_parts.append(f"""            <tr class="{row_class}">
+                <td>{iteration}</td>
+                <td><span class="score {train_class}">{train_correct}/{train_runs}</span></td>
+                <td><span class="score {test_class}">{test_correct}/{test_runs}</span></td>
+                <td class="description">{html.escape(description)}</td>
+""")
+
+        # Add result for each train query
+        for qinfo in train_queries:
+            r = train_by_query.get(qinfo["query"], {})
+            did_pass = r.get("pass", False)
+            triggers = r.get("triggers", 0)
+            runs = r.get("runs", 0)
+
+            icon = "✓" if did_pass else "✗"
+            css_class = "pass" if did_pass else "fail"
+
+            html_parts.append(f'                <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n')
+
+        # Add result for each test query (with different background)
+        for qinfo in test_queries:
+            r = test_by_query.get(qinfo["query"], {})
+            did_pass = r.get("pass", False)
+            triggers = r.get("triggers", 0)
+            runs = r.get("runs", 0)
+
+            icon = "✓" if did_pass else "✗"
+            css_class = "pass" if did_pass else "fail"
+
+            html_parts.append(f'                <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n')
+
+        html_parts.append("            </tr>\n")
+
+    html_parts.append("""        </tbody>
+    </table>
+    </div>
+""")
+
+    html_parts.append("""
+</body>
+</html>
+""")
+
+    return "".join(html_parts)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Generate HTML report from run_loop output")
+    parser.add_argument("input", help="Path to JSON output from run_loop.py (or - for stdin)")
+    parser.add_argument("-o", "--output", default=None, help="Output HTML file (default: stdout)")
+    parser.add_argument("--skill-name", default="", help="Skill name to include in the report title")
+    args = parser.parse_args()
+
+    if args.input == "-":
+        data = json.load(sys.stdin)
+    else:
+        data = json.loads(Path(args.input).read_text())
+
+    html_output = generate_html(data, skill_name=args.skill_name)
+
+    if args.output:
+        Path(args.output).write_text(html_output)
+        print(f"Report written to {args.output}", file=sys.stderr)
+    else:
+        print(html_output)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/public/skill-creator/scripts/improve_description.py
+++ b/skills/public/skill-creator/scripts/improve_description.py
@@ -0,0 +1,247 @@
+#!/usr/bin/env python3
+"""Improve a skill description based on eval results.
+
+Takes eval results (from run_eval.py) and generates an improved description
+by calling `claude -p` as a subprocess (same auth pattern as run_eval.py —
+uses the session's Claude Code auth, no separate ANTHROPIC_API_KEY needed).
+"""
+
+import argparse
+import json
+import os
+import re
+import subprocess
+import sys
+from pathlib import Path
+
+from scripts.utils import parse_skill_md
+
+
+def _call_claude(prompt: str, model: str | None, timeout: int = 300) -> str:
+    """Run `claude -p` with the prompt on stdin and return the text response.
+
+    Prompt goes over stdin (not argv) because it embeds the full SKILL.md
+    body and can easily exceed comfortable argv length.
+    """
+    cmd = ["claude", "-p", "--output-format", "text"]
+    if model:
+        cmd.extend(["--model", model])
+
+    # Remove CLAUDECODE env var to allow nesting claude -p inside a
+    # Claude Code session. The guard is for interactive terminal conflicts;
+    # programmatic subprocess usage is safe. Same pattern as run_eval.py.
+    env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
+
+    result = subprocess.run(
+        cmd,
+        input=prompt,
+        capture_output=True,
+        text=True,
+        env=env,
+        timeout=timeout,
+    )
+    if result.returncode != 0:
+        raise RuntimeError(
+            f"claude -p exited {result.returncode}\nstderr: {result.stderr}"
+        )
+    return result.stdout
+
+
+def improve_description(
+    skill_name: str,
+    skill_content: str,
+    current_description: str,
+    eval_results: dict,
+    history: list[dict],
+    model: str,
+    test_results: dict | None = None,
+    log_dir: Path | None = None,
+    iteration: int | None = None,
+) -> str:
+    """Call Claude to improve the description based on eval results."""
+    failed_triggers = [
+        r for r in eval_results["results"]
+        if r["should_trigger"] and not r["pass"]
+    ]
+    false_triggers = [
+        r for r in eval_results["results"]
+        if not r["should_trigger"] and not r["pass"]
+    ]
+
+    # Build scores summary
+    train_score = f"{eval_results['summary']['passed']}/{eval_results['summary']['total']}"
+    if test_results:
+        test_score = f"{test_results['summary']['passed']}/{test_results['summary']['total']}"
+        scores_summary = f"Train: {train_score}, Test: {test_score}"
+    else:
+        scores_summary = f"Train: {train_score}"
+
+    prompt = f"""You are optimizing a skill description for a Claude Code skill called "{skill_name}". A "skill" is sort of like a prompt, but with progressive disclosure -- there's a title and description that Claude sees when deciding whether to use the skill, and then if it does use the skill, it reads the .md file which has lots more details and potentially links to other resources in the skill folder like helper files and scripts and additional documentation or examples.
+
+The description appears in Claude's "available_skills" list. When a user sends a query, Claude decides whether to invoke the skill based solely on the title and on this description. Your goal is to write a description that triggers for relevant queries, and doesn't trigger for irrelevant ones.
+
+Here's the current description:
+<current_description>
+"{current_description}"
+</current_description>
+
+Current scores ({scores_summary}):
+<scores_summary>
+"""
+    if failed_triggers:
+        prompt += "FAILED TO TRIGGER (should have triggered but didn't):\n"
+        for r in failed_triggers:
+            prompt += f'  - "{r["query"]}" (triggered {r["triggers"]}/{r["runs"]} times)\n'
+        prompt += "\n"
+
+    if false_triggers:
+        prompt += "FALSE TRIGGERS (triggered but shouldn't have):\n"
+        for r in false_triggers:
+            prompt += f'  - "{r["query"]}" (triggered {r["triggers"]}/{r["runs"]} times)\n'
+        prompt += "\n"
+
+    if history:
+        prompt += "PREVIOUS ATTEMPTS (do NOT repeat these — try something structurally different):\n\n"
+        for h in history:
+            train_s = f"{h.get('train_passed', h.get('passed', 0))}/{h.get('train_total', h.get('total', 0))}"
+            test_s = f"{h.get('test_passed', '?')}/{h.get('test_total', '?')}" if h.get('test_passed') is not None else None
+            score_str = f"train={train_s}" + (f", test={test_s}" if test_s else "")
+            prompt += f'<attempt {score_str}>\n'
+            prompt += f'Description: "{h["description"]}"\n'
+            if "results" in h:
+                prompt += "Train results:\n"
+                for r in h["results"]:
+                    status = "PASS" if r["pass"] else "FAIL"
+                    prompt += f'  [{status}] "{r["query"][:80]}" (triggered {r["triggers"]}/{r["runs"]})\n'
+            if h.get("note"):
+                prompt += f'Note: {h["note"]}\n'
+            prompt += "</attempt>\n\n"
+
+    prompt += f"""</scores_summary>
+
+Skill content (for context on what the skill does):
+<skill_content>
+{skill_content}
+</skill_content>
+
+Based on the failures, write a new and improved description that is more likely to trigger correctly. When I say "based on the failures", it's a bit of a tricky line to walk because we don't want to overfit to the specific cases you're seeing. So what I DON'T want you to do is produce an ever-expanding list of specific queries that this skill should or shouldn't trigger for. Instead, try to generalize from the failures to broader categories of user intent and situations where this skill would be useful or not useful. The reason for this is twofold:
+
+1. Avoid overfitting
+2. The list might get loooong and it's injected into ALL queries and there might be a lot of skills, so we don't want to blow too much space on any given description.
+
+Concretely, your description should not be more than about 100-200 words, even if that comes at the cost of accuracy. There is a hard limit of 1024 characters — descriptions over that will be truncated, so stay comfortably under it.
+
+Here are some tips that we've found to work well in writing these descriptions:
+- The skill should be phrased in the imperative -- "Use this skill for" rather than "this skill does"
+- The skill description should focus on the user's intent, what they are trying to achieve, vs. the implementation details of how the skill works.
+- The description competes with other skills for Claude's attention — make it distinctive and immediately recognizable.
+- If you're getting lots of failures after repeated attempts, change things up. Try different sentence structures or wordings.
+
+I'd encourage you to be creative and mix up the style in different iterations since you'll have multiple opportunities to try different approaches and we'll just grab the highest-scoring one at the end. 
+
+Please respond with only the new description text in <new_description> tags, nothing else."""
+
+    text = _call_claude(prompt, model)
+
+    match = re.search(r"<new_description>(.*?)</new_description>", text, re.DOTALL)
+    description = match.group(1).strip().strip('"') if match else text.strip().strip('"')
+
+    transcript: dict = {
+        "iteration": iteration,
+        "prompt": prompt,
+        "response": text,
+        "parsed_description": description,
+        "char_count": len(description),
+        "over_limit": len(description) > 1024,
+    }
+
+    # Safety net: the prompt already states the 1024-char hard limit, but if
+    # the model blew past it anyway, make one fresh single-turn call that
+    # quotes the too-long version and asks for a shorter rewrite. (The old
+    # SDK path did this as a true multi-turn; `claude -p` is one-shot, so we
+    # inline the prior output into the new prompt instead.)
+    if len(description) > 1024:
+        shorten_prompt = (
+            f"{prompt}\n\n"
+            f"---\n\n"
+            f"A previous attempt produced this description, which at "
+            f"{len(description)} characters is over the 1024-character hard limit:\n\n"
+            f'"{description}"\n\n'
+            f"Rewrite it to be under 1024 characters while keeping the most "
+            f"important trigger words and intent coverage. Respond with only "
+            f"the new description in <new_description> tags."
+        )
+        shorten_text = _call_claude(shorten_prompt, model)
+        match = re.search(r"<new_description>(.*?)</new_description>", shorten_text, re.DOTALL)
+        shortened = match.group(1).strip().strip('"') if match else shorten_text.strip().strip('"')
+
+        transcript["rewrite_prompt"] = shorten_prompt
+        transcript["rewrite_response"] = shorten_text
+        transcript["rewrite_description"] = shortened
+        transcript["rewrite_char_count"] = len(shortened)
+        description = shortened
+
+    transcript["final_description"] = description
+
+    if log_dir:
+        log_dir.mkdir(parents=True, exist_ok=True)
+        log_file = log_dir / f"improve_iter_{iteration or 'unknown'}.json"
+        log_file.write_text(json.dumps(transcript, indent=2))
+
+    return description
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Improve a skill description based on eval results")
+    parser.add_argument("--eval-results", required=True, help="Path to eval results JSON (from run_eval.py)")
+    parser.add_argument("--skill-path", required=True, help="Path to skill directory")
+    parser.add_argument("--history", default=None, help="Path to history JSON (previous attempts)")
+    parser.add_argument("--model", required=True, help="Model for improvement")
+    parser.add_argument("--verbose", action="store_true", help="Print thinking to stderr")
+    args = parser.parse_args()
+
+    skill_path = Path(args.skill_path)
+    if not (skill_path / "SKILL.md").exists():
+        print(f"Error: No SKILL.md found at {skill_path}", file=sys.stderr)
+        sys.exit(1)
+
+    eval_results = json.loads(Path(args.eval_results).read_text())
+    history = []
+    if args.history:
+        history = json.loads(Path(args.history).read_text())
+
+    name, _, content = parse_skill_md(skill_path)
+    current_description = eval_results["description"]
+
+    if args.verbose:
+        print(f"Current: {current_description}", file=sys.stderr)
+        print(f"Score: {eval_results['summary']['passed']}/{eval_results['summary']['total']}", file=sys.stderr)
+
+    new_description = improve_description(
+        skill_name=name,
+        skill_content=content,
+        current_description=current_description,
+        eval_results=eval_results,
+        history=history,
+        model=args.model,
+    )
+
+    if args.verbose:
+        print(f"Improved: {new_description}", file=sys.stderr)
+
+    # Output as JSON with both the new description and updated history
+    output = {
+        "description": new_description,
+        "history": history + [{
+            "description": current_description,
+            "passed": eval_results["summary"]["passed"],
+            "failed": eval_results["summary"]["failed"],
+            "total": eval_results["summary"]["total"],
+            "results": eval_results["results"],
+        }],
+    }
+    print(json.dumps(output, indent=2))
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/public/skill-creator/scripts/package_skill.py
+++ b/skills/public/skill-creator/scripts/package_skill.py
@@ -10,10 +10,33 @@ Example:
    python utils/package_skill.py skills/public/my-skill ./dist
 """

+import fnmatch
 import sys
 import zipfile
 from pathlib import Path
-from quick_validate import validate_skill
+from scripts.quick_validate import validate_skill
+
+# Patterns to exclude when packaging skills.
+EXCLUDE_DIRS = {"__pycache__", "node_modules"}
+EXCLUDE_GLOBS = {"*.pyc"}
+EXCLUDE_FILES = {".DS_Store"}
+# Directories excluded only at the skill root (not when nested deeper).
+ROOT_EXCLUDE_DIRS = {"evals"}
+
+
+def should_exclude(rel_path: Path) -> bool:
+    """Check if a path should be excluded from packaging."""
+    parts = rel_path.parts
+    if any(part in EXCLUDE_DIRS for part in parts):
+        return True
+    # rel_path is relative to skill_path.parent, so parts[0] is the skill
+    # folder name and parts[1] (if present) is the first subdir.
+    if len(parts) > 1 and parts[1] in ROOT_EXCLUDE_DIRS:
+        return True
+    name = rel_path.name
+    if name in EXCLUDE_FILES:
+        return True
+    return any(fnmatch.fnmatch(name, pat) for pat in EXCLUDE_GLOBS)


 def package_skill(skill_path, output_dir=None):
@@ -66,13 +89,16 @@ def package_skill(skill_path, output_dir=None):
    # Create the .skill file (zip format)
    try:
        with zipfile.ZipFile(skill_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
-            # Walk through the skill directory
+            # Walk through the skill directory, excluding build artifacts
            for file_path in skill_path.rglob('*'):
-                if file_path.is_file():
-                    # Calculate the relative path within the zip
-                    arcname = file_path.relative_to(skill_path.parent)
-                    zipf.write(file_path, arcname)
-                    print(f"  Added: {arcname}")
+                if not file_path.is_file():
+                    continue
+                arcname = file_path.relative_to(skill_path.parent)
+                if should_exclude(arcname):
+                    print(f"  Skipped: {arcname}")
+                    continue
+                zipf.write(file_path, arcname)
+                print(f"  Added: {arcname}")

        print(f"\n✅ Successfully packaged skill to: {skill_filename}")
        return skill_filename
--- a/skills/public/skill-creator/scripts/quick_validate.py
+++ b/skills/public/skill-creator/scripts/quick_validate.py
@@ -39,7 +39,7 @@ def validate_skill(skill_path):
        return False, f"Invalid YAML in frontmatter: {e}"

    # Define allowed properties
-    ALLOWED_PROPERTIES = {'name', 'description', 'license', 'allowed-tools', 'metadata'}
+    ALLOWED_PROPERTIES = {'name', 'description', 'license', 'allowed-tools', 'metadata', 'compatibility'}

    # Check for unexpected properties (excluding nested keys under metadata)
    unexpected_keys = set(frontmatter.keys()) - ALLOWED_PROPERTIES
@@ -61,9 +61,9 @@ def validate_skill(skill_path):
        return False, f"Name must be a string, got {type(name).__name__}"
    name = name.strip()
    if name:
-        # Check naming convention (hyphen-case: lowercase with hyphens)
+        # Check naming convention (kebab-case: lowercase with hyphens)
        if not re.match(r'^[a-z0-9-]+$', name):
-            return False, f"Name '{name}' should be hyphen-case (lowercase letters, digits, and hyphens only)"
+            return False, f"Name '{name}' should be kebab-case (lowercase letters, digits, and hyphens only)"
        if name.startswith('-') or name.endswith('-') or '--' in name:
            return False, f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens"
        # Check name length (max 64 characters per spec)
@@ -83,6 +83,14 @@ def validate_skill(skill_path):
        if len(description) > 1024:
            return False, f"Description is too long ({len(description)} characters). Maximum is 1024 characters."

+    # Validate compatibility field if present (optional)
+    compatibility = frontmatter.get('compatibility', '')
+    if compatibility:
+        if not isinstance(compatibility, str):
+            return False, f"Compatibility must be a string, got {type(compatibility).__name__}"
+        if len(compatibility) > 500:
+            return False, f"Compatibility is too long ({len(compatibility)} characters). Maximum is 500 characters."
+
    return True, "Skill is valid!"

 if __name__ == "__main__":
--- a/skills/public/skill-creator/scripts/run_eval.py
+++ b/skills/public/skill-creator/scripts/run_eval.py
@@ -0,0 +1,310 @@
+#!/usr/bin/env python3
+"""Run trigger evaluation for a skill description.
+
+Tests whether a skill's description causes Claude to trigger (read the skill)
+for a set of queries. Outputs results as JSON.
+"""
+
+import argparse
+import json
+import os
+import select
+import subprocess
+import sys
+import time
+import uuid
+from concurrent.futures import ProcessPoolExecutor, as_completed
+from pathlib import Path
+
+from scripts.utils import parse_skill_md
+
+
+def find_project_root() -> Path:
+    """Find the project root by walking up from cwd looking for .claude/.
+
+    Mimics how Claude Code discovers its project root, so the command file
+    we create ends up where claude -p will look for it.
+    """
+    current = Path.cwd()
+    for parent in [current, *current.parents]:
+        if (parent / ".claude").is_dir():
+            return parent
+    return current
+
+
+def run_single_query(
+    query: str,
+    skill_name: str,
+    skill_description: str,
+    timeout: int,
+    project_root: str,
+    model: str | None = None,
+) -> bool:
+    """Run a single query and return whether the skill was triggered.
+
+    Creates a command file in .claude/commands/ so it appears in Claude's
+    available_skills list, then runs `claude -p` with the raw query.
+    Uses --include-partial-messages to detect triggering early from
+    stream events (content_block_start) rather than waiting for the
+    full assistant message, which only arrives after tool execution.
+    """
+    unique_id = uuid.uuid4().hex[:8]
+    clean_name = f"{skill_name}-skill-{unique_id}"
+    project_commands_dir = Path(project_root) / ".claude" / "commands"
+    command_file = project_commands_dir / f"{clean_name}.md"
+
+    try:
+        project_commands_dir.mkdir(parents=True, exist_ok=True)
+        # Use YAML block scalar to avoid breaking on quotes in description
+        indented_desc = "\n  ".join(skill_description.split("\n"))
+        command_content = (
+            f"---\n"
+            f"description: |\n"
+            f"  {indented_desc}\n"
+            f"---\n\n"
+            f"# {skill_name}\n\n"
+            f"This skill handles: {skill_description}\n"
+        )
+        command_file.write_text(command_content)
+
+        cmd = [
+            "claude",
+            "-p", query,
+            "--output-format", "stream-json",
+            "--verbose",
+            "--include-partial-messages",
+        ]
+        if model:
+            cmd.extend(["--model", model])
+
+        # Remove CLAUDECODE env var to allow nesting claude -p inside a
+        # Claude Code session. The guard is for interactive terminal conflicts;
+        # programmatic subprocess usage is safe.
+        env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
+
+        process = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.DEVNULL,
+            cwd=project_root,
+            env=env,
+        )
+
+        triggered = False
+        start_time = time.time()
+        buffer = ""
+        # Track state for stream event detection
+        pending_tool_name = None
+        accumulated_json = ""
+
+        try:
+            while time.time() - start_time < timeout:
+                if process.poll() is not None:
+                    remaining = process.stdout.read()
+                    if remaining:
+                        buffer += remaining.decode("utf-8", errors="replace")
+                    break
+
+                ready, _, _ = select.select([process.stdout], [], [], 1.0)
+                if not ready:
+                    continue
+
+                chunk = os.read(process.stdout.fileno(), 8192)
+                if not chunk:
+                    break
+                buffer += chunk.decode("utf-8", errors="replace")
+
+                while "\n" in buffer:
+                    line, buffer = buffer.split("\n", 1)
+                    line = line.strip()
+                    if not line:
+                        continue
+
+                    try:
+                        event = json.loads(line)
+                    except json.JSONDecodeError:
+                        continue
+
+                    # Early detection via stream events
+                    if event.get("type") == "stream_event":
+                        se = event.get("event", {})
+                        se_type = se.get("type", "")
+
+                        if se_type == "content_block_start":
+                            cb = se.get("content_block", {})
+                            if cb.get("type") == "tool_use":
+                                tool_name = cb.get("name", "")
+                                if tool_name in ("Skill", "Read"):
+                                    pending_tool_name = tool_name
+                                    accumulated_json = ""
+                                else:
+                                    return False
+
+                        elif se_type == "content_block_delta" and pending_tool_name:
+                            delta = se.get("delta", {})
+                            if delta.get("type") == "input_json_delta":
+                                accumulated_json += delta.get("partial_json", "")
+                                if clean_name in accumulated_json:
+                                    return True
+
+                        elif se_type in ("content_block_stop", "message_stop"):
+                            if pending_tool_name:
+                                return clean_name in accumulated_json
+                            if se_type == "message_stop":
+                                return False
+
+                    # Fallback: full assistant message
+                    elif event.get("type") == "assistant":
+                        message = event.get("message", {})
+                        for content_item in message.get("content", []):
+                            if content_item.get("type") != "tool_use":
+                                continue
+                            tool_name = content_item.get("name", "")
+                            tool_input = content_item.get("input", {})
+                            if tool_name == "Skill" and clean_name in tool_input.get("skill", ""):
+                                triggered = True
+                            elif tool_name == "Read" and clean_name in tool_input.get("file_path", ""):
+                                triggered = True
+                            return triggered
+
+                    elif event.get("type") == "result":
+                        return triggered
+        finally:
+            # Clean up process on any exit path (return, exception, timeout)
+            if process.poll() is None:
+                process.kill()
+                process.wait()
+
+        return triggered
+    finally:
+        if command_file.exists():
+            command_file.unlink()
+
+
+def run_eval(
+    eval_set: list[dict],
+    skill_name: str,
+    description: str,
+    num_workers: int,
+    timeout: int,
+    project_root: Path,
+    runs_per_query: int = 1,
+    trigger_threshold: float = 0.5,
+    model: str | None = None,
+) -> dict:
+    """Run the full eval set and return results."""
+    results = []
+
+    with ProcessPoolExecutor(max_workers=num_workers) as executor:
+        future_to_info = {}
+        for item in eval_set:
+            for run_idx in range(runs_per_query):
+                future = executor.submit(
+                    run_single_query,
+                    item["query"],
+                    skill_name,
+                    description,
+                    timeout,
+                    str(project_root),
+                    model,
+                )
+                future_to_info[future] = (item, run_idx)
+
+        query_triggers: dict[str, list[bool]] = {}
+        query_items: dict[str, dict] = {}
+        for future in as_completed(future_to_info):
+            item, _ = future_to_info[future]
+            query = item["query"]
+            query_items[query] = item
+            if query not in query_triggers:
+                query_triggers[query] = []
+            try:
+                query_triggers[query].append(future.result())
+            except Exception as e:
+                print(f"Warning: query failed: {e}", file=sys.stderr)
+                query_triggers[query].append(False)
+
+    for query, triggers in query_triggers.items():
+        item = query_items[query]
+        trigger_rate = sum(triggers) / len(triggers)
+        should_trigger = item["should_trigger"]
+        if should_trigger:
+            did_pass = trigger_rate >= trigger_threshold
+        else:
+            did_pass = trigger_rate < trigger_threshold
+        results.append({
+            "query": query,
+            "should_trigger": should_trigger,
+            "trigger_rate": trigger_rate,
+            "triggers": sum(triggers),
+            "runs": len(triggers),
+            "pass": did_pass,
+        })
+
+    passed = sum(1 for r in results if r["pass"])
+    total = len(results)
+
+    return {
+        "skill_name": skill_name,
+        "description": description,
+        "results": results,
+        "summary": {
+            "total": total,
+            "passed": passed,
+            "failed": total - passed,
+        },
+    }
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Run trigger evaluation for a skill description")
+    parser.add_argument("--eval-set", required=True, help="Path to eval set JSON file")
+    parser.add_argument("--skill-path", required=True, help="Path to skill directory")
+    parser.add_argument("--description", default=None, help="Override description to test")
+    parser.add_argument("--num-workers", type=int, default=10, help="Number of parallel workers")
+    parser.add_argument("--timeout", type=int, default=30, help="Timeout per query in seconds")
+    parser.add_argument("--runs-per-query", type=int, default=3, help="Number of runs per query")
+    parser.add_argument("--trigger-threshold", type=float, default=0.5, help="Trigger rate threshold")
+    parser.add_argument("--model", default=None, help="Model to use for claude -p (default: user's configured model)")
+    parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
+    args = parser.parse_args()
+
+    eval_set = json.loads(Path(args.eval_set).read_text())
+    skill_path = Path(args.skill_path)
+
+    if not (skill_path / "SKILL.md").exists():
+        print(f"Error: No SKILL.md found at {skill_path}", file=sys.stderr)
+        sys.exit(1)
+
+    name, original_description, content = parse_skill_md(skill_path)
+    description = args.description or original_description
+    project_root = find_project_root()
+
+    if args.verbose:
+        print(f"Evaluating: {description}", file=sys.stderr)
+
+    output = run_eval(
+        eval_set=eval_set,
+        skill_name=name,
+        description=description,
+        num_workers=args.num_workers,
+        timeout=args.timeout,
+        project_root=project_root,
+        runs_per_query=args.runs_per_query,
+        trigger_threshold=args.trigger_threshold,
+        model=args.model,
+    )
+
+    if args.verbose:
+        summary = output["summary"]
+        print(f"Results: {summary['passed']}/{summary['total']} passed", file=sys.stderr)
+        for r in output["results"]:
+            status = "PASS" if r["pass"] else "FAIL"
+            rate_str = f"{r['triggers']}/{r['runs']}"
+            print(f"  [{status}] rate={rate_str} expected={r['should_trigger']}: {r['query'][:70]}", file=sys.stderr)
+
+    print(json.dumps(output, indent=2))
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/public/skill-creator/scripts/run_loop.py
+++ b/skills/public/skill-creator/scripts/run_loop.py
@@ -0,0 +1,328 @@
+#!/usr/bin/env python3
+"""Run the eval + improve loop until all pass or max iterations reached.
+
+Combines run_eval.py and improve_description.py in a loop, tracking history
+and returning the best description found. Supports train/test split to prevent
+overfitting.
+"""
+
+import argparse
+import json
+import random
+import sys
+import tempfile
+import time
+import webbrowser
+from pathlib import Path
+
+from scripts.generate_report import generate_html
+from scripts.improve_description import improve_description
+from scripts.run_eval import find_project_root, run_eval
+from scripts.utils import parse_skill_md
+
+
+def split_eval_set(eval_set: list[dict], holdout: float, seed: int = 42) -> tuple[list[dict], list[dict]]:
+    """Split eval set into train and test sets, stratified by should_trigger."""
+    random.seed(seed)
+
+    # Separate by should_trigger
+    trigger = [e for e in eval_set if e["should_trigger"]]
+    no_trigger = [e for e in eval_set if not e["should_trigger"]]
+
+    # Shuffle each group
+    random.shuffle(trigger)
+    random.shuffle(no_trigger)
+
+    # Calculate split points
+    n_trigger_test = max(1, int(len(trigger) * holdout))
+    n_no_trigger_test = max(1, int(len(no_trigger) * holdout))
+
+    # Split
+    test_set = trigger[:n_trigger_test] + no_trigger[:n_no_trigger_test]
+    train_set = trigger[n_trigger_test:] + no_trigger[n_no_trigger_test:]
+
+    return train_set, test_set
+
+
+def run_loop(
+    eval_set: list[dict],
+    skill_path: Path,
+    description_override: str | None,
+    num_workers: int,
+    timeout: int,
+    max_iterations: int,
+    runs_per_query: int,
+    trigger_threshold: float,
+    holdout: float,
+    model: str,
+    verbose: bool,
+    live_report_path: Path | None = None,
+    log_dir: Path | None = None,
+) -> dict:
+    """Run the eval + improvement loop."""
+    project_root = find_project_root()
+    name, original_description, content = parse_skill_md(skill_path)
+    current_description = description_override or original_description
+
+    # Split into train/test if holdout > 0
+    if holdout > 0:
+        train_set, test_set = split_eval_set(eval_set, holdout)
+        if verbose:
+            print(f"Split: {len(train_set)} train, {len(test_set)} test (holdout={holdout})", file=sys.stderr)
+    else:
+        train_set = eval_set
+        test_set = []
+
+    history = []
+    exit_reason = "unknown"
+
+    for iteration in range(1, max_iterations + 1):
+        if verbose:
+            print(f"\n{'='*60}", file=sys.stderr)
+            print(f"Iteration {iteration}/{max_iterations}", file=sys.stderr)
+            print(f"Description: {current_description}", file=sys.stderr)
+            print(f"{'='*60}", file=sys.stderr)
+
+        # Evaluate train + test together in one batch for parallelism
+        all_queries = train_set + test_set
+        t0 = time.time()
+        all_results = run_eval(
+            eval_set=all_queries,
+            skill_name=name,
+            description=current_description,
+            num_workers=num_workers,
+            timeout=timeout,
+            project_root=project_root,
+            runs_per_query=runs_per_query,
+            trigger_threshold=trigger_threshold,
+            model=model,
+        )
+        eval_elapsed = time.time() - t0
+
+        # Split results back into train/test by matching queries
+        train_queries_set = {q["query"] for q in train_set}
+        train_result_list = [r for r in all_results["results"] if r["query"] in train_queries_set]
+        test_result_list = [r for r in all_results["results"] if r["query"] not in train_queries_set]
+
+        train_passed = sum(1 for r in train_result_list if r["pass"])
+        train_total = len(train_result_list)
+        train_summary = {"passed": train_passed, "failed": train_total - train_passed, "total": train_total}
+        train_results = {"results": train_result_list, "summary": train_summary}
+
+        if test_set:
+            test_passed = sum(1 for r in test_result_list if r["pass"])
+            test_total = len(test_result_list)
+            test_summary = {"passed": test_passed, "failed": test_total - test_passed, "total": test_total}
+            test_results = {"results": test_result_list, "summary": test_summary}
+        else:
+            test_results = None
+            test_summary = None
+
+        history.append({
+            "iteration": iteration,
+            "description": current_description,
+            "train_passed": train_summary["passed"],
+            "train_failed": train_summary["failed"],
+            "train_total": train_summary["total"],
+            "train_results": train_results["results"],
+            "test_passed": test_summary["passed"] if test_summary else None,
+            "test_failed": test_summary["failed"] if test_summary else None,
+            "test_total": test_summary["total"] if test_summary else None,
+            "test_results": test_results["results"] if test_results else None,
+            # For backward compat with report generator
+            "passed": train_summary["passed"],
+            "failed": train_summary["failed"],
+            "total": train_summary["total"],
+            "results": train_results["results"],
+        })
+
+        # Write live report if path provided
+        if live_report_path:
+            partial_output = {
+                "original_description": original_description,
+                "best_description": current_description,
+                "best_score": "in progress",
+                "iterations_run": len(history),
+                "holdout": holdout,
+                "train_size": len(train_set),
+                "test_size": len(test_set),
+                "history": history,
+            }
+            live_report_path.write_text(generate_html(partial_output, auto_refresh=True, skill_name=name))
+
+        if verbose:
+            def print_eval_stats(label, results, elapsed):
+                pos = [r for r in results if r["should_trigger"]]
+                neg = [r for r in results if not r["should_trigger"]]
+                tp = sum(r["triggers"] for r in pos)
+                pos_runs = sum(r["runs"] for r in pos)
+                fn = pos_runs - tp
+                fp = sum(r["triggers"] for r in neg)
+                neg_runs = sum(r["runs"] for r in neg)
+                tn = neg_runs - fp
+                total = tp + tn + fp + fn
+                precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
+                recall = tp / (tp + fn) if (tp + fn) > 0 else 1.0
+                accuracy = (tp + tn) / total if total > 0 else 0.0
+                print(f"{label}: {tp+tn}/{total} correct, precision={precision:.0%} recall={recall:.0%} accuracy={accuracy:.0%} ({elapsed:.1f}s)", file=sys.stderr)
+                for r in results:
+                    status = "PASS" if r["pass"] else "FAIL"
+                    rate_str = f"{r['triggers']}/{r['runs']}"
+                    print(f"  [{status}] rate={rate_str} expected={r['should_trigger']}: {r['query'][:60]}", file=sys.stderr)
+
+            print_eval_stats("Train", train_results["results"], eval_elapsed)
+            if test_summary:
+                print_eval_stats("Test ", test_results["results"], 0)
+
+        if train_summary["failed"] == 0:
+            exit_reason = f"all_passed (iteration {iteration})"
+            if verbose:
+                print(f"\nAll train queries passed on iteration {iteration}!", file=sys.stderr)
+            break
+
+        if iteration == max_iterations:
+            exit_reason = f"max_iterations ({max_iterations})"
+            if verbose:
+                print(f"\nMax iterations reached ({max_iterations}).", file=sys.stderr)
+            break
+
+        # Improve the description based on train results
+        if verbose:
+            print(f"\nImproving description...", file=sys.stderr)
+
+        t0 = time.time()
+        # Strip test scores from history so improvement model can't see them
+        blinded_history = [
+            {k: v for k, v in h.items() if not k.startswith("test_")}
+            for h in history
+        ]
+        new_description = improve_description(
+            skill_name=name,
+            skill_content=content,
+            current_description=current_description,
+            eval_results=train_results,
+            history=blinded_history,
+            model=model,
+            log_dir=log_dir,
+            iteration=iteration,
+        )
+        improve_elapsed = time.time() - t0
+
+        if verbose:
+            print(f"Proposed ({improve_elapsed:.1f}s): {new_description}", file=sys.stderr)
+
+        current_description = new_description
+
+    # Find the best iteration by TEST score (or train if no test set)
+    if test_set:
+        best = max(history, key=lambda h: h["test_passed"] or 0)
+        best_score = f"{best['test_passed']}/{best['test_total']}"
+    else:
+        best = max(history, key=lambda h: h["train_passed"])
+        best_score = f"{best['train_passed']}/{best['train_total']}"
+
+    if verbose:
+        print(f"\nExit reason: {exit_reason}", file=sys.stderr)
+        print(f"Best score: {best_score} (iteration {best['iteration']})", file=sys.stderr)
+
+    return {
+        "exit_reason": exit_reason,
+        "original_description": original_description,
+        "best_description": best["description"],
+        "best_score": best_score,
+        "best_train_score": f"{best['train_passed']}/{best['train_total']}",
+        "best_test_score": f"{best['test_passed']}/{best['test_total']}" if test_set else None,
+        "final_description": current_description,
+        "iterations_run": len(history),
+        "holdout": holdout,
+        "train_size": len(train_set),
+        "test_size": len(test_set),
+        "history": history,
+    }
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Run eval + improve loop")
+    parser.add_argument("--eval-set", required=True, help="Path to eval set JSON file")
+    parser.add_argument("--skill-path", required=True, help="Path to skill directory")
+    parser.add_argument("--description", default=None, help="Override starting description")
+    parser.add_argument("--num-workers", type=int, default=10, help="Number of parallel workers")
+    parser.add_argument("--timeout", type=int, default=30, help="Timeout per query in seconds")
+    parser.add_argument("--max-iterations", type=int, default=5, help="Max improvement iterations")
+    parser.add_argument("--runs-per-query", type=int, default=3, help="Number of runs per query")
+    parser.add_argument("--trigger-threshold", type=float, default=0.5, help="Trigger rate threshold")
+    parser.add_argument("--holdout", type=float, default=0.4, help="Fraction of eval set to hold out for testing (0 to disable)")
+    parser.add_argument("--model", required=True, help="Model for improvement")
+    parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
+    parser.add_argument("--report", default="auto", help="Generate HTML report at this path (default: 'auto' for temp file, 'none' to disable)")
+    parser.add_argument("--results-dir", default=None, help="Save all outputs (results.json, report.html, log.txt) to a timestamped subdirectory here")
+    args = parser.parse_args()
+
+    eval_set = json.loads(Path(args.eval_set).read_text())
+    skill_path = Path(args.skill_path)
+
+    if not (skill_path / "SKILL.md").exists():
+        print(f"Error: No SKILL.md found at {skill_path}", file=sys.stderr)
+        sys.exit(1)
+
+    name, _, _ = parse_skill_md(skill_path)
+
+    # Set up live report path
+    if args.report != "none":
+        if args.report == "auto":
+            timestamp = time.strftime("%Y%m%d_%H%M%S")
+            live_report_path = Path(tempfile.gettempdir()) / f"skill_description_report_{skill_path.name}_{timestamp}.html"
+        else:
+            live_report_path = Path(args.report)
+        # Open the report immediately so the user can watch
+        live_report_path.write_text("<html><body><h1>Starting optimization loop...</h1><meta http-equiv='refresh' content='5'></body></html>")
+        webbrowser.open(str(live_report_path))
+    else:
+        live_report_path = None
+
+    # Determine output directory (create before run_loop so logs can be written)
+    if args.results_dir:
+        timestamp = time.strftime("%Y-%m-%d_%H%M%S")
+        results_dir = Path(args.results_dir) / timestamp
+        results_dir.mkdir(parents=True, exist_ok=True)
+    else:
+        results_dir = None
+
+    log_dir = results_dir / "logs" if results_dir else None
+
+    output = run_loop(
+        eval_set=eval_set,
+        skill_path=skill_path,
+        description_override=args.description,
+        num_workers=args.num_workers,
+        timeout=args.timeout,
+        max_iterations=args.max_iterations,
+        runs_per_query=args.runs_per_query,
+        trigger_threshold=args.trigger_threshold,
+        holdout=args.holdout,
+        model=args.model,
+        verbose=args.verbose,
+        live_report_path=live_report_path,
+        log_dir=log_dir,
+    )
+
+    # Save JSON output
+    json_output = json.dumps(output, indent=2)
+    print(json_output)
+    if results_dir:
+        (results_dir / "results.json").write_text(json_output)
+
+    # Write final HTML report (without auto-refresh)
+    if live_report_path:
+        live_report_path.write_text(generate_html(output, auto_refresh=False, skill_name=name))
+        print(f"\nReport: {live_report_path}", file=sys.stderr)
+
+    if results_dir and live_report_path:
+        (results_dir / "report.html").write_text(generate_html(output, auto_refresh=False, skill_name=name))
+
+    if results_dir:
+        print(f"Results saved to: {results_dir}", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/public/skill-creator/scripts/utils.py
+++ b/skills/public/skill-creator/scripts/utils.py
@@ -0,0 +1,47 @@
+"""Shared utilities for skill-creator scripts."""
+
+from pathlib import Path
+
+
+
+def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
+    """Parse a SKILL.md file, returning (name, description, full_content)."""
+    content = (skill_path / "SKILL.md").read_text()
+    lines = content.split("\n")
+
+    if lines[0].strip() != "---":
+        raise ValueError("SKILL.md missing frontmatter (no opening ---)")
+
+    end_idx = None
+    for i, line in enumerate(lines[1:], start=1):
+        if line.strip() == "---":
+            end_idx = i
+            break
+
+    if end_idx is None:
+        raise ValueError("SKILL.md missing frontmatter (no closing ---)")
+
+    name = ""
+    description = ""
+    frontmatter_lines = lines[1:end_idx]
+    i = 0
+    while i < len(frontmatter_lines):
+        line = frontmatter_lines[i]
+        if line.startswith("name:"):
+            name = line[len("name:"):].strip().strip('"').strip("'")
+        elif line.startswith("description:"):
+            value = line[len("description:"):].strip()
+            # Handle YAML multiline indicators (>, |, >-, |-)
+            if value in (">", "|", ">-", "|-"):
+                continuation_lines: list[str] = []
+                i += 1
+                while i < len(frontmatter_lines) and (frontmatter_lines[i].startswith("  ") or frontmatter_lines[i].startswith("\t")):
+                    continuation_lines.append(frontmatter_lines[i].strip())
+                    i += 1
+                description = " ".join(continuation_lines)
+                continue
+            else:
+                description = value.strip('"').strip("'")
+        i += 1
+
+    return name, description, content