feat: support infoquest (#708)

* support infoquest

* support html checker

* support html checker

* change line break format

* change line break format

* change line break format

* change line break format

* change line break format

* change line break format

* change line break format

* change line break format

* Fix several critical issues in the codebase
- Resolve crawler panic by improving error handling
- Fix plan validation to prevent invalid configurations
- Correct InfoQuest crawler JSON conversion logic

* add test for infoquest

* add test for infoquest

* Add InfoQuest introduction to the README

* add test for infoquest

* fix readme for infoquest

* fix readme for infoquest

* resolve the conflict

* resolve the conflict

* resolve the conflict

* Fix formatting of INFOQUEST in SearchEngine enum

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Willem Jiang <143703838+willem-bd@users.noreply.github.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
infoquest-byteplus
2025-12-02 08:16:35 +08:00
committed by GitHub
parent e179fb1632
commit 7ec9e45702
22 changed files with 2103 additions and 94 deletions

View File

@@ -24,9 +24,10 @@ ENABLE_MCP_SERVER_CONFIGURATION=false
# Otherwise, you system could be compromised.
ENABLE_PYTHON_REPL=false
# Search Engine, Supported values: tavily (recommended), duckduckgo, brave_search, arxiv, searx
# Search Engine, Supported values: tavily, infoquest (recommended), duckduckgo, brave_search, arxiv, searx
SEARCH_API=tavily
TAVILY_API_KEY=tvly-xxx
INFOQUEST_API_KEY="infoquest-xxx"
# SEARX_HOST=xxx # Required only if SEARCH_API is searx.(compatible with both Searx and SearxNG)
# BRAVE_SEARCH_API_KEY=xxx # Required only if SEARCH_API is brave_search
# JINA_API_KEY=jina_xxx # Optional, default is None

View File

@@ -14,6 +14,7 @@
Currently, DeerFlow has officially entered the [FaaS Application Center of Volcengine](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market). Users can experience it online through the [experience link](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market/deerflow/?channel=github&source=deerflow) to intuitively feel its powerful functions and convenient operations. At the same time, to meet the deployment needs of different users, DeerFlow supports one-click deployment based on Volcengine. Click the [deployment link](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/application/create?templateId=683adf9e372daa0008aaed5c&channel=github&source=deerflow) to quickly complete the deployment process and start an efficient research journey.
DeerFlow has newly integrated the intelligent search and crawling toolset independently developed by BytePlus--[InfoQuest (supports free online experience)](https://console.byteplus.com/infoquest/infoquests)
Please visit [our official website](https://deerflow.tech/) for more details.
@@ -159,6 +160,13 @@ DeerFlow supports multiple search engines that can be configured in your `.env`
- Requires `TAVILY_API_KEY` in your `.env` file
- Sign up at: https://app.tavily.com/home
- **InfoQuest** (recommended): AI-optimized intelligent search and crawling toolset independently developed by BytePlus
- Requires `INFOQUEST_API_KEY` in your `.env` file
- Support for time range filtering and site filtering
- Provides high-quality search results and content extraction
- Sign up at: https://console.byteplus.com/infoquest/infoquests
- Visit https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest to learn more
- **DuckDuckGo**: Privacy-focused search engine
- No API key required
@@ -177,10 +185,31 @@ DeerFlow supports multiple search engines that can be configured in your `.env`
To configure your preferred search engine, set the `SEARCH_API` variable in your `.env` file:
```bash
# Choose one: tavily, duckduckgo, brave_search, arxiv
# Choose one: tavily, infoquest, duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### Crawling Tools
DeerFlow supports multiple crawling tools that can be configured in your `conf.yaml` file:
- **Jina** (default): Freely accessible web content crawling tool
- **InfoQuest** (recommended): AI-optimized intelligent search and crawling toolset developed by BytePlus
- Requires `INFOQUEST_API_KEY` in your `.env` file
- Provides configurable crawling parameters
- Supports custom timeout settings
- Offers more powerful content extraction capabilities
- Visit https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest to learn more
To configure your preferred crawling tool, set the following in your `conf.yaml` file:
```yaml
CRAWLER_ENGINE:
# Engine type: "jina" (default) or "infoquest"
engine: infoquest
```
### Private Knowledgebase
DeerFlow supports private knowledgebase such as RAGFlow, Qdrant, Milvus, and VikingDB, so that you can use your private documents to answer questions.
@@ -221,8 +250,8 @@ DeerFlow supports private knowledgebase such as RAGFlow, Qdrant, Milvus, and Vik
### Tools and MCP Integrations
- 🔍 **Search and Retrieval**
- Web search via Tavily, Brave Search and more
- Crawling with Jina
- Web search via Tavily, InfoQuest, Brave Search and more
- Crawling with Jina and InfoQuest
- Advanced content extraction
- Support for private knowledgebase
@@ -284,7 +313,6 @@ The system employs a streamlined workflow with the following components:
- Manages the research flow and decides when to generate the final report
3. **Research Team**: A collection of specialized agents that execute the plan:
- **Researcher**: Conducts web searches and information gathering using tools like web search engines, crawling and even MCP services.
- **Coder**: Handles code analysis, execution, and technical tasks using Python REPL tool.
Each agent has access to specific tools optimized for their role and operates within the LangGraph framework
@@ -475,7 +503,6 @@ docker build -t deer-flow-api .
```
Final, start up a docker container running the web server:
```bash
# Replace deer-flow-api-app with your preferred container name
# Start the server then bind to localhost:8000

View File

@@ -13,6 +13,8 @@
Derzeit ist DeerFlow offiziell in das [FaaS-Anwendungszentrum von Volcengine](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market) eingezogen. Benutzer können es über den [Erfahrungslink](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market/deerflow/?channel=github&source=deerflow) online erleben, um seine leistungsstarken Funktionen und bequemen Operationen intuitiv zu spüren. Gleichzeitig unterstützt DeerFlow zur Erfüllung der Bereitstellungsanforderungen verschiedener Benutzer die Ein-Klick-Bereitstellung basierend auf Volcengine. Klicken Sie auf den [Bereitstellungslink](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/application/create?templateId=683adf9e372daa0008aaed5c&channel=github&source=deerflow), um den Bereitstellungsprozess schnell abzuschließen und eine effiziente Forschungsreise zu beginnen.
DeerFlow hat neu die intelligente Such- und Crawling-Toolset von BytePlus integriert - [InfoQuest (unterstützt kostenlose Online-Erfahrung)](https://console.byteplus.com/infoquest/infoquests)
Besuchen Sie [unsere offizielle Website](https://deerflow.tech/) für weitere Details.
## Demo
@@ -156,6 +158,13 @@ DeerFlow unterstützt mehrere Suchmaschinen, die in Ihrer `.env`-Datei über die
- Erfordert `TAVILY_API_KEY` in Ihrer `.env`-Datei
- Registrieren Sie sich unter: https://app.tavily.com/home
- **InfoQuest** (empfohlen): Ein KI-optimiertes intelligentes Such- und Crawling-Toolset, entwickelt von BytePlus
- Erfordert `INFOQUEST_API_KEY` in Ihrer `.env`-Datei
- Unterstützung für Zeitbereichsfilterung und Seitenfilterung
- Bietet qualitativ hochwertige Suchergebnisse und Inhaltsextraktion
- Registrieren Sie sich unter: https://console.byteplus.com/infoquest/infoquests
- Besuchen Sie https://docs.byteplus.com/de/docs/InfoQuest/What_is_Info_Quest für weitere Informationen
- **DuckDuckGo**: Datenschutzorientierte Suchmaschine
- Kein API-Schlüssel erforderlich
@@ -174,10 +183,32 @@ DeerFlow unterstützt mehrere Suchmaschinen, die in Ihrer `.env`-Datei über die
Um Ihre bevorzugte Suchmaschine zu konfigurieren, setzen Sie die Variable `SEARCH_API` in Ihrer `.env`-Datei:
```bash
# Wählen Sie eine: tavily, duckduckgo, brave_search, arxiv
# Wählen Sie eine: tavily, infoquest, duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### Crawling-Tools
- **Jina** (Standard): Kostenloses, zugängliches Webinhalts-Crawling-Tool
- Kein API-Schlüssel erforderlich für grundlegende Funktionen
- Mit API-Schlüssel erhalten Sie höhere Zugriffsraten
- Weitere Informationen unter <https://jina.ai/reader>
- **InfoQuest** (empfohlen): KI-optimiertes intelligentes Such- und Crawling-Toolset, entwickelt von BytePlus
- Erfordert `INFOQUEST_API_KEY` in Ihrer `.env`-Datei
- Bietet konfigurierbare Crawling-Parameter
- Unterstützt benutzerdefinierte Timeout-Einstellungen
- Bietet stärkere Inhaltsextraktionsfähigkeiten
- Weitere Informationen unter <https://docs.byteplus.com/de/docs/InfoQuest/What_is_Info_Quest>
Um Ihr bevorzugtes Crawling-Tool zu konfigurieren, setzen Sie Folgendes in Ihrer `conf.yaml`-Datei:
```yaml
CRAWLER_ENGINE:
# Engine-Typ: "jina" (Standard) oder "infoquest"
engine: infoquest
```
### Private Wissensbasis
DeerFlow unterstützt private Wissensbasen wie RAGFlow und VikingDB, sodass Sie Ihre privaten Dokumente zur Beantwortung von Fragen verwenden können.
@@ -205,8 +236,8 @@ DeerFlow unterstützt private Wissensbasen wie RAGFlow und VikingDB, sodass Sie
### Tools und MCP-Integrationen
- 🔍 **Suche und Abruf**
- Websuche über Tavily, Brave Search und mehr
- Crawling mit Jina
- Websuche über Tavily, InfoQuest, Brave Search und mehr
- Crawling mit Jina und InfoQuest
- Fortgeschrittene Inhaltsextraktion
- Unterstützung für private Wissensbasis
@@ -505,7 +536,6 @@ Die Anwendung unterstützt jetzt einen interaktiven Modus mit eingebauten Fragen
4. Das System wird Ihre Frage verarbeiten und einen umfassenden Forschungsbericht generieren
### Mensch-in-der-Schleife
DeerFlow enthält einen Mensch-in-der-Schleife-Mechanismus, der es Ihnen ermöglicht, Forschungspläne vor ihrer Ausführung zu überprüfen, zu bearbeiten und zu genehmigen:
1. **Planüberprüfung**: Wenn Mensch-in-der-Schleife aktiviert ist, präsentiert das System den generierten Forschungsplan zur Überprüfung vor der Ausführung

View File

@@ -13,6 +13,8 @@
Actualmente, DeerFlow ha ingresado oficialmente al Centro de Aplicaciones FaaS de Volcengine. Los usuarios pueden experimentarlo en línea a través del enlace de experiencia para sentir intuitivamente sus potentes funciones y operaciones convenientes. Al mismo tiempo, para satisfacer las necesidades de implementación de diferentes usuarios, DeerFlow admite la implementación con un clic basada en Volcengine. Haga clic en el enlace de implementación para completar rápidamente el proceso de implementación y comenzar un viaje de investigación eficiente.
DeerFlow ha integrado recientemente el conjunto de herramientas de búsqueda y rastreo inteligente desarrollado independientemente por BytePlus - [InfoQuest (admite experiencia gratuita en línea)](https://console.byteplus.com/infoquest/infoquests)
Por favor, visita [nuestra página web oficial](https://deerflow.tech/) para más detalles.
## Demostración
@@ -155,6 +157,13 @@ DeerFlow soporta múltiples motores de búsqueda que pueden configurarse en tu a
- Requiere `TAVILY_API_KEY` en tu archivo `.env`
- Regístrate en: <https://app.tavily.com/home>
- **InfoQuest** (recomendado): Un conjunto de herramientas inteligentes de búsqueda y rastreo optimizadas para IA, desarrollado por BytePlus
- Requiere `INFOQUEST_API_KEY` en tu archivo `.env`
- Soporte para filtrado por rango de fecha y filtrado de sitios web
- Proporciona resultados de búsqueda y extracción de contenido de alta calidad
- Regístrate en: <https://console.byteplus.com/infoquest/infoquests>
- Visita https://docs.byteplus.com/es/docs/InfoQuest/What_is_Info_Quest para obtener más información
- **DuckDuckGo**: Motor de búsqueda centrado en la privacidad
- No requiere clave API
@@ -175,10 +184,32 @@ DeerFlow soporta múltiples motores de búsqueda que pueden configurarse en tu a
Para configurar tu motor de búsqueda preferido, establece la variable `SEARCH_API` en tu archivo `.env`:
```bash
# Elige uno: tavily, duckduckgo, brave_search, arxiv
# Elige uno: tavily, infoquest, duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### Herramientas de Rastreo
- **Jina** (predeterminado): Herramienta gratuita de rastreo de contenido web accesible
- No se requiere clave API para usar funciones básicas
- Al usar una clave API, se obtienen límites de tasa de acceso más altos
- Visite <https://jina.ai/reader> para obtener más información
- **InfoQuest** (recomendado): Conjunto de herramientas inteligentes de búsqueda y rastreo optimizadas para IA, desarrollado por BytePlus
- Requiere `INFOQUEST_API_KEY` en tu archivo `.env`
- Proporciona parámetros de rastreo configurables
- Admite configuración de tiempo de espera personalizada
- Ofrece capacidades más potentes de extracción de contenido
- Visita <https://docs.byteplus.com/es/docs/InfoQuest/What_is_Info_Quest> para obtener más información
Para configurar su herramienta de rastreo preferida, establezca lo siguiente en su archivo `conf.yaml`:
```yaml
CRAWLER_ENGINE:
# Tipo de motor: "jina" (predeterminado) o "infoquest"
engine: infoquest
```
## Características
### Capacidades Principales
@@ -193,8 +224,8 @@ SEARCH_API=tavily
- 🔍 **Búsqueda y Recuperación**
- Búsqueda web a través de Tavily, Brave Search y más
- Rastreo con Jina
- Búsqueda web a través de Tavily, InfoQuest, Brave Search y más
- Rastreo con Jina e InfoQuest
- Extracción avanzada de contenido
- 🔗 **Integración Perfecta con MCP**

View File

@@ -11,6 +11,8 @@
現在、DeerFlow は火山引擎の FaaS アプリケーションセンターに正式に入居しています。ユーザーは体験リンクを通じてオンラインで体験し、その強力な機能と便利な操作を直感的に感じることができます。同時に、さまざまなユーザーの展開ニーズを満たすため、DeerFlow は火山引擎に基づくワンクリック展開をサポートしています。展開リンクをクリックして展開プロセスを迅速に完了し、効率的な研究の旅を始めましょう。
DeerFlow は新たにBytePlusが自主開発したインテリジェント検索・クローリングツールセットを統合しました--[InfoQuest (オンライン無料体験をサポート)](https://console.byteplus.com/infoquest/infoquests)
詳細については[DeerFlow の公式ウェブサイト](https://deerflow.tech/)をご覧ください。
## デモ
@@ -151,6 +153,13 @@ DeerFlow は複数の検索エンジンをサポートしており、`.env`フ
- `.env`ファイルに`TAVILY_API_KEY`が必要
- 登録先:<https://app.tavily.com/home>
- **InfoQuest**推奨BytePlusが開発したAI最適化のインテリジェント検索とクローリングツールセット
- `.env`ファイルに`INFOQUEST_API_KEY`が必要
- 時間範囲フィルタリングとサイトフィルタリングをサポート
- 高品質な検索結果とコンテンツ抽出を提供
- 登録先:<https://console.byteplus.com/infoquest/infoquests>
- ドキュメント:<https://docs.byteplus.com/ja/docs/InfoQuest/What_is_Info_Quest>
- **DuckDuckGo**:プライバシー重視の検索エンジン
- APIキー不要
@@ -169,10 +178,32 @@ DeerFlow は複数の検索エンジンをサポートしており、`.env`フ
お好みの検索エンジンを設定するには、`.env`ファイルで`SEARCH_API`変数を設定します:
```bash
# 選択肢: tavily, duckduckgo, brave_search, arxiv
# 選択肢: tavily, infoquest, duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### クローリングツール
- **Jina**(デフォルト):無料でアクセス可能なウェブコンテンツクローリングツール
- 基本機能を使用するにはAPIキーは不要
- APIキーを使用するとより高いアクセスレート制限が適用されます
- 詳細については <https://jina.ai/reader> を参照してください
- **InfoQuest**推奨BytePlusが開発したAI最適化のインテリジェント検索とクローリングツールセット
- `.env`ファイルに`INFOQUEST_API_KEY`が必要
- 設定可能なクローリングパラメータを提供
- カスタムタイムアウト設定をサポート
- より強力なコンテンツ抽出機能を提供
- 詳細については <https://docs.byteplus.com/ja/docs/InfoQuest/What_is_Info_Quest> を参照してください
お好みのクローリングツールを設定するには、`conf.yaml`ファイルで以下を設定します:
```yaml
CRAWLER_ENGINE:
# エンジンタイプ:"jina"(デフォルト)または "infoquest"
engine: infoquest
```
## 特徴
### コア機能
@@ -186,8 +217,8 @@ SEARCH_API=tavily
### ツールと MCP 統合
- 🔍 **検索と取得**
- Tavily、Brave Searchなどを通じたWeb検索
- Jinaを使用したクローリング
- Tavily、InfoQuest、Brave Searchなどを通じたWeb検索
- JinaとInfoQuestを使用したクローリング
- 高度なコンテンツ抽出
- 🔗 **MCPシームレス統合**

View File

@@ -14,6 +14,8 @@
Atualmente, o DeerFlow entrou oficialmente no Centro de Aplicações FaaS da Volcengine. Os usuários podem experimentá-lo online através do link de experiência para sentir intuitivamente suas funções poderosas e operações convenientes. Ao mesmo tempo, para atender às necessidades de implantação de diferentes usuários, o DeerFlow suporta implantação com um clique baseada na Volcengine. Clique no link de implantação para completar rapidamente o processo de implantação e iniciar uma jornada de pesquisa eficiente.
O DeerFlow recentemente integrou o conjunto de ferramentas de busca e rastreamento inteligente desenvolvido independentemente pela BytePlus — [InfoQuest (oferece experiência gratuita online)](https://console.byteplus.com/infoquest/infoquests)
Por favor, visite [Nosso Site Oficial](https://deerflow.tech/) para maiores detalhes.
## Demo
@@ -158,6 +160,13 @@ DeerFlow suporta múltiplos mecanismos de busca que podem ser configurados no se
- Requer `TAVILY_API_KEY` no seu arquivo `.env`
- Inscreva-se em: <https://app.tavily.com/home>
- **InfoQuest** (recomendado): Um conjunto de ferramentas inteligentes de busca e crawling otimizadas para IA, desenvolvido pela BytePlus
- Requer `INFOQUEST_API_KEY` no seu arquivo `.env`
- Suporte para filtragem por intervalo de tempo e filtragem de sites
- Fornece resultados de busca e extração de conteúdo de alta qualidade
- Inscreva-se em: <https://console.byteplus.com/infoquest/infoquests>
- Visite https://docs.byteplus.com/pt/docs/InfoQuest/What_is_Info_Quest para obter mais informações
- **DuckDuckGo**: Mecanismo de busca focado em privacidade
- Não requer chave API
@@ -178,10 +187,32 @@ DeerFlow suporta múltiplos mecanismos de busca que podem ser configurados no se
Para configurar o seu mecanismo preferido, defina a variável `SEARCH_API` no seu arquivo:
```bash
# Escolha uma: tavily, duckduckgo, brave_search, arxiv
# Escolha uma: tavily, infoquest, duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### Ferramentas de Crawling
- **Jina** (padrão): Ferramenta gratuita de crawling de conteúdo web acessível
- Não é necessária chave API para usar recursos básicos
- Ao usar uma chave API, você obtém limites de taxa de acesso mais altos
- Visite <https://jina.ai/reader> para obter mais informações
- **InfoQuest** (recomendado): Conjunto de ferramentas inteligentes de busca e crawling otimizadas para IA, desenvolvido pela BytePlus
- Requer `INFOQUEST_API_KEY` no seu arquivo `.env`
- Fornece parâmetros de crawling configuráveis
- Suporta configurações de timeout personalizadas
- Oferece capacidades mais poderosas de extração de conteúdo
- Visite <https://docs.byteplus.com/pt/docs/InfoQuest/What_is_Info_Quest> para obter mais informações
Para configurar sua ferramenta de crawling preferida, defina o seguinte em seu arquivo `conf.yaml`:
```yaml
CRAWLER_ENGINE:
# Tipo de mecanismo: "jina" (padrão) ou "infoquest"
engine: infoquest
```
## Funcionalidades
### Principais Funcionalidades
@@ -197,8 +228,8 @@ SEARCH_API=tavily
- 🔍 **Busca e Recuperação**
- Busca web com Tavily, Brave Search e mais
- Crawling com Jina
- Busca web com Tavily, InfoQuest, Brave Search e mais
- Crawling com Jina e InfoQuest
- Extração de Conteúdo avançada
- 🔗 **Integração MCP perfeita**

View File

@@ -13,6 +13,8 @@
В настоящее время DeerFlow официально вошел в Центр приложений FaaS Volcengine. Пользователи могут испытать его онлайн через ссылку для опыта, чтобы интуитивно почувствовать его мощные функции и удобные операции. В то же время, для удовлетворения потребностей развертывания различных пользователей, DeerFlow поддерживает развертывание одним кликом на основе Volcengine. Нажмите на ссылку развертывания, чтобы быстро завершить процесс развертывания и начать эффективное исследовательское путешествие.
DeerFlow недавно интегрировал интеллектуальный набор инструментов поиска и краулинга, разработанный самостоятельно компанией BytePlus — [InfoQuest (поддерживает бесплатное онлайн-опробование)](https://console.byteplus.com/infoquest/infoquests)
Пожалуйста, посетите [наш официальный сайт](https://deerflow.tech/) для получения дополнительной информации.
## Демонстрация
@@ -155,6 +157,13 @@ DeerFlow поддерживает несколько поисковых сист
- Требуется `TAVILY_API_KEY` в вашем файле `.env`
- Зарегистрируйтесь на: <https://app.tavily.com/home>
- **InfoQuest** (рекомендуется): Набор интеллектуальных инструментов для поиска и сканирования, оптимизированных для ИИ, разработанный компанией BytePlus
- Требуется `INFOQUEST_API_KEY` в вашем файле `.env`
- Поддержка фильтрации по диапазону времени и фильтрации сайтов
- Предоставляет высококачественные результаты поиска и извлечение контента
- Зарегистрируйтесь на: <https://console.byteplus.com/infoquest/infoquests>
- Посетите https://docs.byteplus.com/ru/docs/InfoQuest/What_is_Info_Quest для получения дополнительной информации
- **DuckDuckGo**: Поисковая система, ориентированная на конфиденциальность
- Не требуется API-ключ
@@ -175,10 +184,32 @@ DeerFlow поддерживает несколько поисковых сист
Чтобы настроить предпочитаемую поисковую систему, установите переменную `SEARCH_API` в вашем файле `.env`:
```bash
# Выберите одно: tavily, duckduckgo, brave_search, arxiv
# Выберите одно: tavily, infoquest, duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### Инструменты сканирования
- **Jina** (по умолчанию): Бесплатный доступный инструмент для сканирования веб-контента
- API-ключ не требуется для использования базовых функций
- При использовании API-ключа вы получаете более высокие лимиты скорости доступа
- Посетите <https://jina.ai/reader> для получения дополнительной информации
- **InfoQuest** (рекомендуется): Набор интеллектуальных инструментов для поиска и сканирования, оптимизированных для ИИ, разработанный компанией BytePlus
- Требуется `INFOQUEST_API_KEY` в вашем файле `.env`
- Предоставляет настраиваемые параметры сканирования
- Поддерживает настройки пользовательских тайм-аутов
- Предоставляет более мощные возможности извлечения контента
- Посетите <https://docs.byteplus.com/ru/docs/InfoQuest/What_is_Info_Quest> для получения дополнительной информации
Чтобы настроить предпочитаемый инструмент сканирования, установите следующее в вашем файле `conf.yaml`:
```yaml
CRAWLER_ENGINE:
# Тип движка: "jina" (по умолчанию) или "infoquest"
engine: infoquest
```
## Особенности
### Ключевые возможности
@@ -193,8 +224,8 @@ SEARCH_API=tavily
- 🔍 **Поиск и извлечение**
- Веб-поиск через Tavily, Brave Search и другие
- Сканирование с Jina
- Веб-поиск через Tavily, InfoQuest, Brave Search и другие
- Сканирование с Jina и InfoQuest
- Расширенное извлечение контента
- 🔗 **Бесшовная интеграция MCP**

View File

@@ -11,6 +11,8 @@
目前DeerFlow 已正式入驻[火山引擎的 FaaS 应用中心](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market),用户可通过[体验链接](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market/deerflow/?channel=github&source=deerflow)进行在线体验直观感受其强大功能与便捷操作同时为满足不同用户的部署需求DeerFlow 支持基于火山引擎一键部署,点击[部署链接](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/application/create?templateId=683adf9e372daa0008aaed5c&channel=github&source=deerflow)即可快速完成部署流程,开启高效研究之旅。
DeerFlow 新接入BytePlus自主推出的智能搜索与爬取工具集--[InfoQuest支持在线免费体验](https://console.byteplus.com/infoquest/infoquests)
请访问[DeerFlow 的官方网站](https://deerflow.tech/)了解更多详情。
## 演示
@@ -152,6 +154,13 @@ DeerFlow 支持多种搜索引擎,可以在`.env`文件中通过`SEARCH_API`
- 需要在`.env`文件中设置`TAVILY_API_KEY`
- 注册地址:<https://app.tavily.com/home>
- **InfoQuest**推荐BytePlus自主研发的专为AI应用优化的智能搜索与爬取工具集
- 需要在`.env`文件中设置`INFOQUEST_API_KEY`
- 支持时间范围过滤和站点过滤
- 提供高质量的搜索结果和内容提取
- 注册地址:<https://console.byteplus.com/infoquest/infoquests>
- 访问 <https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest> 了解更多信息
- **DuckDuckGo**:注重隐私的搜索引擎
- 无需 API 密钥
@@ -170,10 +179,32 @@ DeerFlow 支持多种搜索引擎,可以在`.env`文件中通过`SEARCH_API`
要配置您首选的搜索引擎,请在`.env`文件中设置`SEARCH_API`变量:
```bash
# 选择一个tavily, duckduckgo, brave_search, arxiv
# 选择一个tavily, infoquest,duckduckgo, brave_search, arxiv
SEARCH_API=tavily
```
### 爬取工具
- **Jina**(默认):免费可访问的网页内容爬取工具
- 无需 API 密钥即可使用基础功能
- 使用 API 密钥可获得更高的访问速率限制
- 访问 <https://jina.ai/reader> 了解更多信息
- **InfoQuest**推荐BytePlus自主研发的专为AI应用优化的智能搜索与爬取工具集
- 需要在`.env`文件中设置`INFOQUEST_API_KEY`
- 提供可配置的爬取参数
- 支持自定义超时设置
- 提供更强大的内容提取能力
- 访问 <https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest> 了解更多信息
要配置您首选的爬取工具,请在`conf.yaml`文件中设置:
```yaml
CRAWLER_ENGINE:
# 引擎类型:"jina"(默认)或 "infoquest"
engine: infoquest
```
### 私域知识库引擎
DeerFlow 支持基于私有域知识的检索,您可以将文档上传到多种私有知识库中,以便在研究过程中使用,当前支持的私域知识库有:
@@ -221,8 +252,8 @@ DeerFlow 支持基于私有域知识的检索,您可以将文档上传到多
### 工具和 MCP 集成
- 🔍 **搜索和检索**
- 通过 Tavily、Brave Search 等进行网络搜索
- 使用 Jina 进行爬取
- 通过 Tavily、InfoQuest、Brave Search 等进行网络搜索
- 使用 Jina、InfoQuest 进行爬取
- 高级内容提取
- 支持检索指定私有知识库

View File

@@ -61,9 +61,13 @@ BASIC_MODEL:
# # When interrupt is triggered, user will be prompted to approve/reject
# # Approved keywords: "approved", "approve", "yes", "proceed", "continue", "ok", "okay", "accepted", "accept"
# Search engine configuration (Only supports Tavily currently)
# Search engine configuration
# Supported engines: tavily, infoquest
# SEARCH_ENGINE:
# engine: tavily
# # Engine type to use: "tavily" or "infoquest"
# engine: tavily or infoquest
#
# # The following parameters are specific to Tavily
# # Only include results from these domains
# include_domains:
# - example.com
@@ -88,3 +92,28 @@ BASIC_MODEL:
# min_score_threshold: 0.0
# # Maximum content length per page
# max_content_length_per_page: 4000
#
# # The following parameters are specific to InfoQuest
# # Used to limit the scope of search results, only returns content within the specified time range. Set to -1 to disable time filtering
# time_range: 30
# # Used to limit the scope of search results, only returns content from specified whitelisted domains. Set to empty string to disable site filtering
# site: "example.com"
# Crawler engine configuration
# Supported engines: jina (default), infoquest
# Uncomment the following section to configure crawler engine
# CRAWLER_ENGINE:
# # Engine type to use: "jina" (default) or "infoquest"
# engine: infoquest
#
# # The following timeout parameters are only effective when engine is set to "infoquest"
# # Waiting time after page loading (in seconds)
# # Set to positive value to enable, -1 to disable
# fetch_time: 10
# # Overall timeout for the entire crawling process (in seconds)
# # Set to positive value to enable, -1 to disable
# timeout: 30
# # Timeout for navigating to the page (in seconds)
# # Set to positive value to enable, -1 to disable
# navi_timeout: 15

View File

@@ -11,6 +11,7 @@ load_dotenv()
class SearchEngine(enum.Enum):
TAVILY = "tavily"
INFOQUEST = "infoquest"
DUCKDUCKGO = "duckduckgo"
BRAVE_SEARCH = "brave_search"
ARXIV = "arxiv"
@@ -18,10 +19,14 @@ class SearchEngine(enum.Enum):
WIKIPEDIA = "wikipedia"
class CrawlerEngine(enum.Enum):
JINA = "jina"
INFOQUEST = "infoquest"
# Tool configuration
SELECTED_SEARCH_ENGINE = os.getenv("SEARCH_API", SearchEngine.TAVILY.value)
class RAGProvider(enum.Enum):
DIFY = "dify"
RAGFLOW = "ragflow"

View File

@@ -4,9 +4,12 @@
import re
import logging
from .article import Article
from .jina_client import JinaClient
from .readability_extractor import ReadabilityExtractor
from src.config.tools import CrawlerEngine
from src.config import load_yaml_config
from src.crawler.article import Article
from src.crawler.infoquest_client import InfoQuestClient
from src.crawler.jina_client import JinaClient
from src.crawler.readability_extractor import ReadabilityExtractor
logger = logging.getLogger(__name__)
@@ -138,17 +141,21 @@ class Crawler:
# them into text and image blocks for one single and unified
# LLM message.
#
# Jina is not the best crawler on readability, however it's
# much easier and free to use.
# The system supports multiple crawler engines:
# - Jina: An accessible solution, though with some limitations in readability extraction
# - InfoQuest: A BytePlus product offering advanced capabilities with configurable parameters
# like fetch_time, timeout, and navi_timeout.
#
# Instead of using Jina's own markdown converter, we'll use
# our own solution to get better readability results.
try:
jina_client = JinaClient()
html = jina_client.crawl(url, return_format="html")
except Exception as e:
logger.error(f"Failed to fetch URL {url} from Jina: {repr(e)}")
raise
# Get crawler configuration
config = load_yaml_config("conf.yaml")
crawler_config = config.get("CRAWLER_ENGINE", {})
# Get the selected crawler tool based on configuration
crawler_client = self._select_crawler_tool(crawler_config)
html = self._crawl_with_tool(crawler_client, url)
# Check if we got valid HTML content
if not html or not html.strip():
@@ -186,3 +193,44 @@ class Crawler:
article.url = url
return article
def _select_crawler_tool(self, crawler_config: dict):
# Only check engine from configuration file
engine = crawler_config.get("engine", CrawlerEngine.JINA.value)
if engine == CrawlerEngine.JINA.value:
logger.info(f"Selecting Jina crawler engine")
return JinaClient()
elif engine == CrawlerEngine.INFOQUEST.value:
logger.info(f"Selecting InfoQuest crawler engine")
# Read timeout parameters directly from crawler_config root level
# These parameters are only effective when engine is set to "infoquest"
fetch_time = crawler_config.get("fetch_time", -1)
timeout = crawler_config.get("timeout", -1)
navi_timeout = crawler_config.get("navi_timeout", -1)
# Log the configuration being used
if fetch_time > 0 or timeout > 0 or navi_timeout > 0:
logger.debug(
f"Initializing InfoQuestCrawler with parameters: "
f"fetch_time={fetch_time}, "
f"timeout={timeout}, "
f"navi_timeout={navi_timeout}"
)
# Initialize InfoQuestClient with the parameters from configuration
return InfoQuestClient(
fetch_time=fetch_time,
timeout=timeout,
navi_timeout=navi_timeout
)
else:
raise ValueError(f"Unsupported crawler engine: {engine}")
def _crawl_with_tool(self, crawler_client, url: str) -> str:
logger.info(f"Crawling URL: {url} using {crawler_client.__class__.__name__}")
try:
return crawler_client.crawl(url, return_format="html")
except Exception as e:
logger.error(f"Failed to fetch URL {url} using {crawler_client.__class__.__name__}: {repr(e)}")
raise

View File

@@ -0,0 +1,153 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
"""Util that calls InfoQuest Crawler API.
In order to set this up, follow instructions at:
https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest
"""
import json
import logging
import os
from typing import Dict, Any
import requests
logger = logging.getLogger(__name__)
class InfoQuestClient:
"""Client for interacting with the InfoQuest web crawling API."""
def __init__(self, fetch_time: int = -1, timeout: int = -1, navi_timeout: int = -1):
logger.info(
"\n============================================\n"
"🚀 BytePlus InfoQuest Crawler Initialization 🚀\n"
"============================================"
)
self.fetch_time = fetch_time
self.timeout = timeout
self.navi_timeout = navi_timeout
self.api_key_set = bool(os.getenv("INFOQUEST_API_KEY"))
config_details = (
f"\n📋 Configuration Details:\n"
f"├── Fetch Timeout: {fetch_time} {'(Default: No timeout)' if fetch_time == -1 else '(Custom)'}\n"
f"├── Timeout: {timeout} {'(Default: No timeout)' if timeout == -1 else '(Custom)'}\n"
f"├── Navigation Timeout: {navi_timeout} {'(Default: No timeout)' if navi_timeout == -1 else '(Custom)'}\n"
f"└── API Key: {'✅ Configured' if self.api_key_set else '❌ Not set'}"
)
logger.info(config_details)
logger.info("\n" + "*" * 70 + "\n")
def crawl(self, url: str, return_format: str = "html") -> str:
logger.debug("Preparing request for URL: %s", url)
# Prepare headers
headers = self._prepare_headers()
# Prepare request data
data = self._prepare_request_data(url, return_format)
# Log request details
logger.debug(
"InfoQuest Crawler request prepared: endpoint=https://reader.infoquest.bytepluses.com, "
"format=%s, has_api_key=%s",
data.get("format"), self.api_key_set
)
logger.debug("Sending crawl request to InfoQuest API")
try:
response = requests.post(
"https://reader.infoquest.bytepluses.com",
headers=headers,
json=data
)
# Check if status code is not 200
if response.status_code != 200:
error_message = f"InfoQuest API returned status {response.status_code}: {response.text}"
logger.error(error_message)
return f"Error: {error_message}"
# Check for empty response
if not response.text or not response.text.strip():
error_message = "InfoQuest Crawler API returned empty response"
logger.error("BytePlus InfoQuest Crawler returned empty response for URL: %s", url)
return f"Error: {error_message}"
# Try to parse response as JSON and extract reader_result
try:
response_data = json.loads(response.text)
# Extract reader_result if it exists
if "reader_result" in response_data:
logger.debug("Successfully extracted reader_result from JSON response")
return response_data["reader_result"]
elif "content" in response_data:
# Fallback to content field if reader_result is not available
logger.debug("Using content field as fallback")
return response_data["content"]
else:
# If neither field exists, return the original response
logger.warning("Neither reader_result nor content field found in JSON response")
except json.JSONDecodeError:
# If response is not JSON, return the original text
logger.debug("Response is not in JSON format, returning as-is")
# Print partial response for debugging
if logger.isEnabledFor(logging.DEBUG):
response_sample = response.text[:200] + ("..." if len(response.text) > 200 else "")
logger.debug(
"Successfully received response, content length: %d bytes, first 200 chars: %s",
len(response.text), response_sample
)
return response.text
except Exception as e:
error_message = f"Request to InfoQuest API failed: {str(e)}"
logger.error(error_message)
return f"Error: {error_message}"
def _prepare_headers(self) -> Dict[str, str]:
"""Prepare request headers."""
headers = {
"Content-Type": "application/json",
}
# Add API key if available
if os.getenv("INFOQUEST_API_KEY"):
headers["Authorization"] = f"Bearer {os.getenv('INFOQUEST_API_KEY')}"
logger.debug("API key added to request headers")
else:
logger.warning(
"InfoQuest API key is not set. Provide your own key for authentication."
)
return headers
def _prepare_request_data(self, url: str, return_format: str) -> Dict[str, Any]:
"""Prepare request data with formatted parameters."""
# Normalize return_format
if return_format and return_format.lower() == "html":
normalized_format = "HTML"
else:
normalized_format = return_format
data = {"url": url, "format": normalized_format}
# Add timeout parameters if set to positive values
timeout_params = {}
if self.fetch_time > 0:
timeout_params["fetch_time"] = self.fetch_time
if self.timeout > 0:
timeout_params["timeout"] = self.timeout
if self.navi_timeout > 0:
timeout_params["navi_timeout"] = self.navi_timeout
# Log applied timeout parameters
if timeout_params:
logger.debug("Applying timeout parameters: %s", timeout_params)
data.update(timeout_params)
return data

View File

@@ -22,12 +22,21 @@ class JinaClient:
"Jina API key is not set. Provide your own key to access a higher rate limit. See https://jina.ai/reader for more information."
)
data = {"url": url}
try:
response = requests.post("https://r.jina.ai/", headers=headers, json=data)
if response.status_code != 200:
raise ValueError(f"Jina API returned status {response.status_code}: {response.text}")
error_message = f"Jina API returned status {response.status_code}: {response.text}"
logger.error(error_message)
return f"Error: {error_message}"
if not response.text or not response.text.strip():
raise ValueError("Jina API returned empty response")
error_message = "Jina API returned empty response"
logger.error(error_message)
return f"Error: {error_message}"
return response.text
except Exception as e:
error_message = f"Request to Jina API failed: {str(e)}"
logger.error(error_message)
return f"Error: {error_message}"

View File

@@ -0,0 +1,4 @@
from .infoquest_search_api import InfoQuestAPIWrapper
from .infoquest_search_results import InfoQuestSearchResults
__all__ = ["InfoQuestAPIWrapper", "InfoQuestSearchResults"]

View File

@@ -0,0 +1,232 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
"""Util that calls InfoQuest Search API.
In order to set this up, follow instructions at:
https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest
"""
import json
from typing import Any, Dict, List
import aiohttp
import requests
from langchain_core.utils import get_from_dict_or_env
from pydantic import BaseModel, ConfigDict, SecretStr, model_validator
from src.config import load_yaml_config
import logging
logger = logging.getLogger(__name__)
INFOQUEST_API_URL = "https://search.infoquest.bytepluses.com"
def get_search_config():
config = load_yaml_config("conf.yaml")
search_config = config.get("SEARCH_ENGINE", {})
return search_config
class InfoQuestAPIWrapper(BaseModel):
"""Wrapper for InfoQuest Search API."""
infoquest_api_key: SecretStr
model_config = ConfigDict(
extra="forbid",
)
@model_validator(mode="before")
@classmethod
def validate_environment(cls, values: Dict) -> Any:
"""Validate that api key and endpoint exists in environment."""
logger.info("Initializing BytePlus InfoQuest Product - Search API client")
infoquest_api_key = get_from_dict_or_env(
values, "infoquest_api_key", "INFOQUEST_API_KEY"
)
values["infoquest_api_key"] = infoquest_api_key
logger.info("BytePlus InfoQuest Product - Environment validation successful")
return values
def raw_results(
self,
query: str,
time_range: int,
site: str,
output_format: str = "JSON",
) -> Dict:
"""Get results from the InfoQuest Search API synchronously."""
if logger.isEnabledFor(logging.DEBUG):
query_truncated = query[:50] + "..." if len(query) > 50 else query
logger.debug(
f"InfoQuest - Search API request initiated | "
f"operation=search | "
f"query_truncated={query_truncated} | "
f"has_time_filter={time_range > 0} | "
f"has_site_filter={bool(site)} | "
f"request_type=sync"
)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.infoquest_api_key.get_secret_value()}",
}
params = {
"format": output_format,
"query": query
}
if time_range > 0:
params["time_range"] = time_range
logger.debug(f"InfoQuest - Applying time range filter: time_range_days={time_range}")
if site != "":
params["site"] = site
logger.debug(f"InfoQuest - Applying site filter: site={site}")
response = requests.post(
f"{INFOQUEST_API_URL}",
headers=headers,
json=params
)
response.raise_for_status()
# Print partial response for debugging
response_json = response.json()
if logger.isEnabledFor(logging.DEBUG):
response_sample = json.dumps(response_json)[:200] + ("..." if len(json.dumps(response_json)) > 200 else "")
logger.debug(
f"Search API request completed successfully | "
f"service=InfoQuest | "
f"status=success | "
f"response_sample={response_sample}"
)
return response_json["search_result"]
async def raw_results_async(
self,
query: str,
time_range: int,
site: str,
output_format: str = "JSON",
) -> Dict:
"""Get results from the InfoQuest Search API asynchronously."""
if logger.isEnabledFor(logging.DEBUG):
query_truncated = query[:50] + "..." if len(query) > 50 else query
logger.debug(
f"BytePlus InfoQuest - Search API async request initiated | "
f"operation=search | "
f"query_truncated={query_truncated} | "
f"has_time_filter={time_range > 0} | "
f"has_site_filter={bool(site)} | "
f"request_type=async"
)
# Function to perform the API call
async def fetch() -> str:
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.infoquest_api_key.get_secret_value()}",
}
params = {
"format": output_format,
"query": query,
}
if time_range > 0:
params["time_range"] = time_range
logger.debug(f"Applying time range filter in async request: {time_range} days")
if site != "":
params["site"] = site
logger.debug(f"Applying site filter in async request: {site}")
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.post(f"{INFOQUEST_API_URL}", headers=headers, json=params) as res:
if res.status == 200:
data = await res.text()
return data
else:
raise Exception(f"Error {res.status}: {res.reason}")
results_json_str = await fetch()
# Print partial response for debugging
if logger.isEnabledFor(logging.DEBUG):
response_sample = results_json_str[:200] + ("..." if len(results_json_str) > 200 else "")
logger.debug(
f"Async search API request completed successfully | "
f"service=InfoQuest | "
f"status=success | "
f"response_sample={response_sample}"
)
return json.loads(results_json_str)["search_result"]
def clean_results_with_images(
self, raw_results: List[Dict[str, Dict[str, Dict[str, Any]]]]
) -> List[Dict]:
"""Clean results from InfoQuest Search API."""
logger.debug("Processing search results")
seen_urls = set()
clean_results = []
counts = {"pages": 0, "news": 0, "images": 0}
for content_list in raw_results:
content = content_list["content"]
results = content["results"]
if results.get("organic"):
organic_results = results["organic"]
for result in organic_results:
clean_result = {
"type": "page",
"title": result["title"],
"url": result["url"],
"desc": result["desc"],
}
url = clean_result["url"]
if isinstance(url, str) and url and url not in seen_urls:
seen_urls.add(url)
clean_results.append(clean_result)
counts["pages"] += 1
if results.get("top_stories"):
news = results["top_stories"]
for obj in news["items"]:
clean_result = {
"type": "news",
"time_frame": obj["time_frame"],
"title": obj["title"],
"url": obj["url"],
"source": obj["source"],
}
url = clean_result["url"]
if isinstance(url, str) and url and url not in seen_urls:
seen_urls.add(url)
clean_results.append(clean_result)
counts["news"] += 1
if results.get("images"):
images = results["images"]
for image in images["items"]:
clean_result = {
"type": "image_url",
"image_url": image["url"],
"image_description": image["alt"],
}
url = clean_result["image_url"]
if isinstance(url, str) and url and url not in seen_urls:
seen_urls.add(url)
clean_results.append(clean_result)
counts["images"] += 1
logger.debug(
f"Results processing completed | "
f"total_results={len(clean_results)} | "
f"pages={counts['pages']} | "
f"news_items={counts['news']} | "
f"images={counts['images']} | "
f"unique_urls={len(seen_urls)}"
)
return clean_results

View File

@@ -0,0 +1,236 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
"""Tool for the InfoQuest search API."""
import json
import logging
from typing import Any, Dict, List, Literal, Optional, Tuple, Type, Union
from langchain_core.callbacks import (
AsyncCallbackManagerForToolRun,
CallbackManagerForToolRun,
)
from langchain_core.tools import BaseTool
from pydantic import BaseModel, Field
from src.tools.infoquest_search.infoquest_search_api import InfoQuestAPIWrapper
logger = logging.getLogger(__name__)
class InfoQuestInput(BaseModel):
"""Input for the InfoQuest tool."""
query: str = Field(description="search query to look up")
class InfoQuestSearchResults(BaseTool):
"""Tool that queries the InfoQuest Search API and returns processed results with images.
Setup:
Install required packages and set environment variable ``INFOQUEST_API_KEY``.
.. code-block:: bash
pip install -U langchain-community aiohttp
export INFOQUEST_API_KEY="your-api-key"
Instantiate:
.. code-block:: python
from your_module import InfoQuestSearch
tool = InfoQuestSearchResults(
output_format="json",
time_range=10,
site="nytimes.com"
)
Invoke directly with args:
.. code-block:: python
tool.invoke({
'query': 'who won the last french open'
})
.. code-block:: json
[
{
"type": "page",
"title": "Djokovic Claims French Open Title...",
"url": "https://www.nytimes.com/...",
"desc": "Novak Djokovic won the 2024 French Open by defeating Casper Ruud..."
},
{
"type": "news",
"time_frame": "2 days ago",
"title": "French Open Finals Recap",
"url": "https://www.nytimes.com/...",
"source": "New York Times"
},
{
"type": "image_url",
"image_url": {"url": "https://www.nytimes.com/.../djokovic.jpg"},
"image_description": "Novak Djokovic celebrating his French Open victory"
}
]
Invoke with tool call:
.. code-block:: python
tool.invoke({
"args": {
'query': 'who won the last french open',
},
"type": "tool_call",
"id": "foo",
"name": "infoquest"
})
.. code-block:: python
ToolMessage(
content='[
{"type": "page", "title": "Djokovic Claims...", "url": "https://www.nytimes.com/...", "desc": "Novak Djokovic won..."},
{"type": "news", "time_frame": "2 days ago", "title": "French Open Finals...", "url": "https://www.nytimes.com/...", "source": "New York Times"},
{"type": "image_url", "image_url": {"url": "https://www.nytimes.com/.../djokovic.jpg"}, "image_description": "Novak Djokovic celebrating..."}
]',
tool_call_id='1',
name='infoquest_search_results_json',
)
""" # noqa: E501
name: str = "infoquest_search_results_json"
description: str = (
"A search engine optimized for comprehensive, accurate, and trusted results. "
"Useful for when you need to answer questions about current events. "
"Input should be a search query."
)
args_schema: Type[BaseModel] = InfoQuestInput
"""The tool response format."""
time_range: int = -1
"""Time range for filtering search results, in days.
If set to a positive integer (e.g., 30), only results from the last N days will be included.
Default is -1, which means no time range filter is applied.
"""
site: str = ""
"""Specific domain to restrict search results to (e.g., "nytimes.com").
If provided, only results from the specified domain will be returned.
Default is an empty string, which means no domain restriction is applied.
"""
api_wrapper: InfoQuestAPIWrapper = Field(default_factory=InfoQuestAPIWrapper) # type: ignore[arg-type]
response_format: Literal["content_and_artifact"] = "content_and_artifact"
def __init__(self, **kwargs: Any) -> None:
# Create api_wrapper with infoquest_api_key if provided
if "infoquest_api_key" in kwargs:
kwargs["api_wrapper"] = InfoQuestAPIWrapper(
infoquest_api_key=kwargs["infoquest_api_key"]
)
logger.debug("API wrapper initialized with provided key")
super().__init__(**kwargs)
logger.info(
"\n============================================\n"
"🚀 BytePlus InfoQuest Search Initialization 🚀\n"
"============================================"
)
# Prepare initialization details
time_range_status = f"{self.time_range} days" if hasattr(self, 'time_range') and self.time_range > 0 else "Disabled"
site_filter = f"'{self.site}'" if hasattr(self, 'site') and self.site else "Disabled"
initialization_details = (
f"\n🔧 Tool Information:\n"
f"├── Tool Name: {self.name}\n"
f"├── Time Range Filter: {time_range_status}\n"
f"└── Site Filter: {site_filter}\n"
f"📊 Configuration Summary:\n"
f"├── Response Format: {self.response_format}\n"
)
logger.info(initialization_details)
logger.info("\n" + "*" * 70 + "\n")
def _run(
self,
query: str,
run_manager: Optional[CallbackManagerForToolRun] = None,
) -> Tuple[Union[List[Dict[str, str]], str], Dict]:
"""Use the tool."""
try:
logger.debug(f"Executing search with parameters: time_range={self.time_range}, site={self.site}")
raw_results = self.api_wrapper.raw_results(
query,
self.time_range,
self.site
)
logger.debug("Processing raw search results")
cleaned_results = self.api_wrapper.clean_results_with_images(raw_results["results"])
result_json = json.dumps(cleaned_results, ensure_ascii=False)
logger.info(
f"Search tool execution completed | "
f"mode=synchronous | "
f"results_count={len(cleaned_results)}"
)
return result_json, raw_results
except Exception as e:
logger.error(
f"Search tool execution failed | "
f"mode=synchronous | "
f"error={str(e)}"
)
error_result = json.dumps({"error": repr(e)}, ensure_ascii=False)
return error_result, {}
async def _arun(
self,
query: str,
run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
) -> Tuple[Union[List[Dict[str, str]], str], Dict]:
"""Use the tool asynchronously."""
if logger.isEnabledFor(logging.DEBUG):
query_truncated = query[:50] + "..." if len(query) > 50 else query
logger.debug(
f"Search tool execution started | "
f"mode=asynchronous | "
f"query={query_truncated}"
)
try:
logger.debug(f"Executing async search with parameters: time_range={self.time_range}, site={self.site}")
raw_results = await self.api_wrapper.raw_results_async(
query,
self.time_range,
self.site
)
logger.debug("Processing raw async search results")
cleaned_results = self.api_wrapper.clean_results_with_images(raw_results["results"])
result_json = json.dumps(cleaned_results, ensure_ascii=False)
logger.debug(
f"Search tool execution completed | "
f"mode=asynchronous | "
f"results_count={len(cleaned_results)}"
)
return result_json, raw_results
except Exception as e:
logger.error(
f"Search tool execution failed | "
f"mode=asynchronous | "
f"error={str(e)}"
)
error_result = json.dumps({"error": repr(e)}, ensure_ascii=False)
return error_result, {}

View File

@@ -21,6 +21,7 @@ from langchain_community.utilities import (
from src.config import SELECTED_SEARCH_ENGINE, SearchEngine, load_yaml_config
from src.tools.decorators import create_logged_tool
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
from src.tools.tavily_search.tavily_search_results_with_images import (
TavilySearchWithImages,
)
@@ -29,6 +30,7 @@ logger = logging.getLogger(__name__)
# Create logged versions of the search tools
LoggedTavilySearch = create_logged_tool(TavilySearchWithImages)
LoggedInfoQuestSearch = create_logged_tool(InfoQuestSearchResults)
LoggedDuckDuckGoSearch = create_logged_tool(DuckDuckGoSearchResults)
LoggedBraveSearch = create_logged_tool(BraveSearch)
LoggedArxivSearch = create_logged_tool(ArxivQueryRun)
@@ -76,6 +78,17 @@ def get_web_search_tool(max_search_results: int):
include_domains=include_domains,
exclude_domains=exclude_domains,
)
elif SELECTED_SEARCH_ENGINE == SearchEngine.INFOQUEST.value:
time_range = search_config.get("time_range", -1)
site = search_config.get("site", "")
logger.info(
f"InfoQuest search configuration loaded: time_range={time_range}, site={site}"
)
return LoggedInfoQuestSearch(
name="web_search",
time_range=time_range,
site=site,
)
elif SELECTED_SEARCH_ENGINE == SearchEngine.DUCKDUCKGO.value:
return LoggedDuckDuckGoSearch(
name="web_search",

View File

@@ -3,6 +3,7 @@
import src.crawler as crawler_module
from src.crawler.crawler import safe_truncate
from src.crawler.infoquest_client import InfoQuestClient
def test_crawler_sets_article_url(monkeypatch):
@@ -19,16 +20,28 @@ def test_crawler_sets_article_url(monkeypatch):
def crawl(self, url, return_format=None):
return "<html>dummy</html>"
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
pass
def crawl(self, url, return_format=None):
return "<html>dummy</html>"
class DummyReadabilityExtractor:
def extract_article(self, html):
return DummyArticle()
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.Crawler()
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
@@ -44,6 +57,16 @@ def test_crawler_calls_dependencies(monkeypatch):
calls["jina"] = (url, return_format)
return "<html>dummy</html>"
# Fix: Update DummyInfoQuestClient to accept initialization parameters
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
# We don't need to use these parameters, just accept them
pass
def crawl(self, url, return_format=None):
calls["infoquest"] = (url, return_format)
return "<html>dummy</html>"
class DummyReadabilityExtractor:
def extract_article(self, html):
calls["extractor"] = html
@@ -56,12 +79,16 @@ def test_crawler_calls_dependencies(monkeypatch):
return DummyArticle()
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
# Add mock for load_yaml_config to ensure it returns configuration with Jina engine
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
crawler = crawler_module.Crawler()
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient) # Include this if InfoQuest might be used
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
crawler.crawl(url)
assert "jina" in calls
@@ -92,16 +119,61 @@ def test_crawler_handles_empty_content(monkeypatch):
# This should not be called for empty content
assert False, "ReadabilityExtractor should not be called for empty content"
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
crawler = crawler_module.Crawler()
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title == "Empty Content"
assert "No content could be extracted" in article.html_content
assert "No content could be extracted from this page" in article.html_content
def test_crawler_handles_error_response_from_client(monkeypatch):
"""Test that the crawler handles error responses from the client gracefully."""
class DummyArticle:
def __init__(self, title, html_content):
self.title = title
self.html_content = html_content
self.url = None
def to_markdown(self):
return f"# {self.title}"
class DummyJinaClient:
def crawl(self, url, return_format=None):
return "Error: API returned status 500"
class DummyReadabilityExtractor:
def extract_article(self, html):
# This should not be called for error responses
assert False, "ReadabilityExtractor should not be called for error responses"
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
assert "Error: API returned status 500" in article.html_content
def test_crawler_handles_non_html_content(monkeypatch):
@@ -125,16 +197,22 @@ def test_crawler_handles_non_html_content(monkeypatch):
# This should not be called for non-HTML content
assert False, "ReadabilityExtractor should not be called for non-HTML content"
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
crawler = crawler_module.Crawler()
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title == "Non-HTML Content"
assert "cannot be parsed as HTML" in article.html_content
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
assert "cannot be parsed as HTML" in article.html_content or "Content extraction failed" in article.html_content
assert "plain text content" in article.html_content # Should include a snippet of the original content
@@ -158,10 +236,16 @@ def test_crawler_handles_extraction_failure(monkeypatch):
def extract_article(self, html):
raise Exception("Extraction failed")
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
crawler = crawler_module.Crawler()
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
@@ -192,16 +276,22 @@ def test_crawler_with_json_like_content(monkeypatch):
# This should not be called for JSON content
assert False, "ReadabilityExtractor should not be called for JSON content"
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
crawler = crawler_module.Crawler()
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com/api/data"
article = crawler.crawl(url)
assert article.url == url
assert article.title == "Non-HTML Content"
assert "cannot be parsed as HTML" in article.html_content
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
assert "cannot be parsed as HTML" in article.html_content or "Content extraction failed" in article.html_content
assert '{"title": "Some JSON"' in article.html_content # Should include a snippet of the JSON
@@ -241,6 +331,9 @@ def test_crawler_with_various_html_formats(monkeypatch):
def extract_article(self, html):
return DummyArticle("Extracted Article", "<p>Extracted content</p>")
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "jina"}}
# Test each HTML format
test_cases = [
(DummyJinaClient1, "HTML with DOCTYPE"),
@@ -252,8 +345,9 @@ def test_crawler_with_various_html_formats(monkeypatch):
for JinaClientClass, description in test_cases:
monkeypatch.setattr("src.crawler.crawler.JinaClient", JinaClientClass)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.Crawler()
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
@@ -298,3 +392,284 @@ def test_safe_truncate_function():
assert len(result) <= 10
# Verify it's valid UTF-8
assert result.encode('utf-8').decode('utf-8') == result
# ========== InfoQuest Client Tests ==========
def test_crawler_selects_infoquest_engine(monkeypatch):
"""Test that the crawler selects InfoQuestClient when configured to use it."""
calls = {}
class DummyJinaClient:
def crawl(self, url, return_format=None):
calls["jina"] = True
return "<html>dummy</html>"
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
calls["infoquest_init"] = (fetch_time, timeout, navi_timeout)
def crawl(self, url, return_format=None):
calls["infoquest"] = (url, return_format)
return "<html>dummy from infoquest</html>"
class DummyReadabilityExtractor:
def extract_article(self, html):
calls["extractor"] = html
class DummyArticle:
url = None
def to_markdown(self):
return "# Dummy"
return DummyArticle()
# Mock configuration to use InfoQuest engine with custom parameters
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {
"engine": "infoquest",
"fetch_time": 30,
"timeout": 60,
"navi_timeout": 45
}}
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
crawler.crawl(url)
# Verify InfoQuestClient was used, not JinaClient
assert "infoquest_init" in calls
assert calls["infoquest_init"] == (30, 60, 45) # Verify parameters were passed correctly
assert "infoquest" in calls
assert calls["infoquest"][0] == url
assert calls["infoquest"][1] == "html"
assert "extractor" in calls
assert calls["extractor"] == "<html>dummy from infoquest</html>"
assert "jina" not in calls
def test_crawler_with_infoquest_empty_content(monkeypatch):
"""Test that the crawler handles empty content from InfoQuest client gracefully."""
class DummyArticle:
def __init__(self, title, html_content):
self.title = title
self.html_content = html_content
self.url = None
def to_markdown(self):
return f"# {self.title}"
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
pass
def crawl(self, url, return_format=None):
return "" # Empty content
class DummyReadabilityExtractor:
def extract_article(self, html):
# This should not be called for empty content
assert False, "ReadabilityExtractor should not be called for empty content"
# Mock configuration to use InfoQuest engine
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title == "Empty Content"
assert "No content could be extracted from this page" in article.html_content
def test_crawler_with_infoquest_non_html_content(monkeypatch):
"""Test that the crawler handles non-HTML content from InfoQuest client gracefully."""
class DummyArticle:
def __init__(self, title, html_content):
self.title = title
self.html_content = html_content
self.url = None
def to_markdown(self):
return f"# {self.title}"
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
pass
def crawl(self, url, return_format=None):
return "This is plain text content from InfoQuest, not HTML"
class DummyReadabilityExtractor:
def extract_article(self, html):
# This should not be called for non-HTML content
assert False, "ReadabilityExtractor should not be called for non-HTML content"
# Mock configuration to use InfoQuest engine
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
assert "cannot be parsed as HTML" in article.html_content or "Content extraction failed" in article.html_content
assert "plain text content from InfoQuest" in article.html_content
def test_crawler_with_infoquest_error_response(monkeypatch):
"""Test that the crawler handles error responses from InfoQuest client gracefully."""
class DummyArticle:
def __init__(self, title, html_content):
self.title = title
self.html_content = html_content
self.url = None
def to_markdown(self):
return f"# {self.title}"
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
pass
def crawl(self, url, return_format=None):
return "Error: InfoQuest API returned status 403: Forbidden"
class DummyReadabilityExtractor:
def extract_article(self, html):
# This should not be called for error responses
assert False, "ReadabilityExtractor should not be called for error responses"
# Mock configuration to use InfoQuest engine
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
assert "Error: InfoQuest API returned status 403: Forbidden" in article.html_content
def test_crawler_with_infoquest_json_response(monkeypatch):
"""Test that the crawler handles JSON responses from InfoQuest client correctly."""
class DummyArticle:
def __init__(self, title, html_content):
self.title = title
self.html_content = html_content
self.url = None
def to_markdown(self):
return f"# {self.title}"
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
pass
def crawl(self, url, return_format=None):
return "<html><body>Content from InfoQuest JSON</body></html>"
class DummyReadabilityExtractor:
def extract_article(self, html):
return DummyArticle("Extracted from JSON", html)
# Mock configuration to use InfoQuest engine
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr(
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
url = "http://example.com"
article = crawler.crawl(url)
assert article.url == url
assert article.title == "Extracted from JSON"
assert "Content from InfoQuest JSON" in article.html_content
def test_infoquest_client_initialization_params():
"""Test that InfoQuestClient correctly initializes with the provided parameters."""
# Test default parameters
client_default = InfoQuestClient()
assert client_default.fetch_time == -1
assert client_default.timeout == -1
assert client_default.navi_timeout == -1
# Test custom parameters
client_custom = InfoQuestClient(fetch_time=30, timeout=60, navi_timeout=45)
assert client_custom.fetch_time == 30
assert client_custom.timeout == 60
assert client_custom.navi_timeout == 45
def test_crawler_with_infoquest_default_parameters(monkeypatch):
"""Test that the crawler initializes InfoQuestClient with default parameters when none are provided."""
calls = {}
class DummyInfoQuestClient:
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
calls["infoquest_init"] = (fetch_time, timeout, navi_timeout)
def crawl(self, url, return_format=None):
return "<html>dummy</html>"
class DummyReadabilityExtractor:
def extract_article(self, html):
class DummyArticle:
url = None
def to_markdown(self):
return "# Dummy"
return DummyArticle()
# Mock configuration to use InfoQuest engine without custom parameters
def mock_load_config(*args, **kwargs):
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
crawler = crawler_module.crawler.Crawler()
crawler.crawl("http://example.com")
# Verify default parameters were passed
assert "infoquest_init" in calls
assert calls["infoquest_init"] == (-1, -1, -1)

View File

@@ -0,0 +1,230 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
from unittest.mock import Mock, patch
import json
from src.crawler.infoquest_client import InfoQuestClient
class TestInfoQuestClient:
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_success(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html><body>Test Content</body></html>"
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result == "<html><body>Test Content</body></html>"
mock_post.assert_called_once()
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_json_response_with_reader_result(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
json_data = {
"reader_result": "<p>Extracted content from JSON</p>",
"err_code": 0,
"err_msg": "success"
}
mock_response.text = json.dumps(json_data)
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result == "<p>Extracted content from JSON</p>"
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_json_response_with_content_fallback(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
json_data = {
"content": "<p>Content fallback from JSON</p>",
"err_code": 0,
"err_msg": "success"
}
mock_response.text = json.dumps(json_data)
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result == "<p>Content fallback from JSON</p>"
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_json_response_without_expected_fields(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
json_data = {
"unexpected_field": "some value",
"err_code": 0,
"err_msg": "success"
}
mock_response.text = json.dumps(json_data)
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result == json.dumps(json_data)
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_http_error(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 500
mock_response.text = "Internal Server Error"
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result.startswith("Error:")
assert "status 500" in result
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_empty_response(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = ""
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result.startswith("Error:")
assert "empty response" in result
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_whitespace_only_response(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = " \n \t "
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result.startswith("Error:")
assert "empty response" in result
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_not_found(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 404
mock_response.text = "Not Found"
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result.startswith("Error:")
assert "status 404" in result
@patch.dict("os.environ", {}, clear=True)
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_without_api_key_logs_warning(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>Test</html>"
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result == "<html>Test</html>"
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_with_timeout_parameters(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "<html>Test</html>"
mock_post.return_value = mock_response
client = InfoQuestClient(fetch_time=10, timeout=20, navi_timeout=30)
# Act
result = client.crawl("https://example.com")
# Assert
assert result == "<html>Test</html>"
# Verify the post call was made with timeout parameters
call_args = mock_post.call_args[1]
assert call_args['json']['fetch_time'] == 10
assert call_args['json']['timeout'] == 20
assert call_args['json']['navi_timeout'] == 30
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_with_markdown_format(self, mock_post):
# Arrange
mock_response = Mock()
mock_response.status_code = 200
mock_response.text = "# Markdown Content"
mock_post.return_value = mock_response
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com", return_format="markdown")
# Assert
assert result == "# Markdown Content"
# Verify the format was set correctly
call_args = mock_post.call_args[1]
assert call_args['json']['format'] == "markdown"
@patch("src.crawler.infoquest_client.requests.post")
def test_crawl_exception_handling(self, mock_post):
# Arrange
mock_post.side_effect = Exception("Network error")
client = InfoQuestClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result.startswith("Error:")
assert "Network error" in result

View File

@@ -36,11 +36,12 @@ class TestJinaClient:
client = JinaClient()
# Act & Assert
with pytest.raises(ValueError) as exc_info:
client.crawl("https://example.com")
# Act
result = client.crawl("https://example.com")
assert "status 500" in str(exc_info.value)
# Assert
assert result.startswith("Error:")
assert "status 500" in result
@patch("src.crawler.jina_client.requests.post")
def test_crawl_empty_response(self, mock_post):
@@ -52,11 +53,12 @@ class TestJinaClient:
client = JinaClient()
# Act & Assert
with pytest.raises(ValueError) as exc_info:
client.crawl("https://example.com")
# Act
result = client.crawl("https://example.com")
assert "empty response" in str(exc_info.value)
# Assert
assert result.startswith("Error:")
assert "empty response" in result
@patch("src.crawler.jina_client.requests.post")
def test_crawl_whitespace_only_response(self, mock_post):
@@ -68,11 +70,12 @@ class TestJinaClient:
client = JinaClient()
# Act & Assert
with pytest.raises(ValueError) as exc_info:
client.crawl("https://example.com")
# Act
result = client.crawl("https://example.com")
assert "empty response" in str(exc_info.value)
# Assert
assert result.startswith("Error:")
assert "empty response" in result
@patch("src.crawler.jina_client.requests.post")
def test_crawl_not_found(self, mock_post):
@@ -84,11 +87,12 @@ class TestJinaClient:
client = JinaClient()
# Act & Assert
with pytest.raises(ValueError) as exc_info:
client.crawl("https://example.com")
# Act
result = client.crawl("https://example.com")
assert "status 404" in str(exc_info.value)
# Assert
assert result.startswith("Error:")
assert "status 404" in result
@patch.dict("os.environ", {}, clear=True)
@patch("src.crawler.jina_client.requests.post")
@@ -106,3 +110,17 @@ class TestJinaClient:
# Assert
assert result == "<html>Test</html>"
@patch("src.crawler.jina_client.requests.post")
def test_crawl_exception_handling(self, mock_post):
# Arrange
mock_post.side_effect = Exception("Network error")
client = JinaClient()
# Act
result = client.crawl("https://example.com")
# Assert
assert result.startswith("Error:")
assert "Network error" in result

View File

@@ -0,0 +1,218 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
from unittest.mock import Mock, patch
import pytest
import requests
from src.tools.infoquest_search.infoquest_search_api import InfoQuestAPIWrapper
class TestInfoQuestAPIWrapper:
@pytest.fixture
def wrapper(self):
# Create a wrapper instance with mock API key
return InfoQuestAPIWrapper(infoquest_api_key="dummy-key")
@pytest.fixture
def mock_response_data(self):
# Mock search result data
return {
"search_result": {
"results": [
{
"content": {
"results": {
"organic": [
{
"title": "Test Title",
"url": "https://example.com",
"desc": "Test description"
}
],
"top_stories": {
"items": [
{
"time_frame": "2 days ago",
"title": "Test News",
"url": "https://example.com/news",
"source": "Test Source"
}
]
},
"images": {
"items": [
{
"url": "https://example.com/image.jpg",
"alt": "Test image description"
}
]
}
}
}
}
]
}
}
@patch("src.tools.infoquest_search.infoquest_search_api.requests.post")
def test_raw_results_success(self, mock_post, wrapper, mock_response_data):
# Test successful synchronous search results
mock_response = Mock()
mock_response.json.return_value = mock_response_data
mock_response.raise_for_status.return_value = None
mock_post.return_value = mock_response
result = wrapper.raw_results("test query", time_range=0, site="")
assert result == mock_response_data["search_result"]
mock_post.assert_called_once()
call_args = mock_post.call_args
assert "json" in call_args.kwargs
assert call_args.kwargs["json"]["query"] == "test query"
assert "time_range" not in call_args.kwargs["json"]
assert "site" not in call_args.kwargs["json"]
@patch("src.tools.infoquest_search.infoquest_search_api.requests.post")
def test_raw_results_with_time_range_and_site(self, mock_post, wrapper, mock_response_data):
# Test search with time range and site filtering
mock_response = Mock()
mock_response.json.return_value = mock_response_data
mock_response.raise_for_status.return_value = None
mock_post.return_value = mock_response
result = wrapper.raw_results("test query", time_range=30, site="example.com")
assert result == mock_response_data["search_result"]
call_args = mock_post.call_args
params = call_args.kwargs["json"]
assert params["time_range"] == 30
assert params["site"] == "example.com"
@patch("src.tools.infoquest_search.infoquest_search_api.requests.post")
def test_raw_results_http_error(self, mock_post, wrapper):
# Test HTTP error handling
mock_response = Mock()
mock_response.raise_for_status.side_effect = requests.HTTPError("API Error")
mock_post.return_value = mock_response
with pytest.raises(requests.HTTPError):
wrapper.raw_results("test query", time_range=0, site="")
# Check if pytest-asyncio is available, otherwise mark for conditional skipping
try:
import pytest_asyncio
_asyncio_available = True
except ImportError:
_asyncio_available = False
@pytest.mark.asyncio
async def test_raw_results_async_success(self, wrapper, mock_response_data):
# Skip only if pytest-asyncio is not installed
if not self._asyncio_available:
pytest.skip("pytest-asyncio is not installed")
with patch('json.loads', return_value=mock_response_data):
original_method = InfoQuestAPIWrapper.raw_results_async
async def mock_raw_results_async(self, query, time_range=0, site="", output_format="json"):
return mock_response_data["search_result"]
InfoQuestAPIWrapper.raw_results_async = mock_raw_results_async
try:
result = await wrapper.raw_results_async("test query", time_range=0, site="")
assert result == mock_response_data["search_result"]
finally:
InfoQuestAPIWrapper.raw_results_async = original_method
@pytest.mark.asyncio
async def test_raw_results_async_error(self, wrapper):
if not self._asyncio_available:
pytest.skip("pytest-asyncio is not installed")
original_method = InfoQuestAPIWrapper.raw_results_async
async def mock_raw_results_async_error(self, query, time_range=0, site="", output_format="json"):
raise Exception("Error 400: Bad Request")
InfoQuestAPIWrapper.raw_results_async = mock_raw_results_async_error
try:
with pytest.raises(Exception, match="Error 400: Bad Request"):
await wrapper.raw_results_async("test query", time_range=0, site="")
finally:
InfoQuestAPIWrapper.raw_results_async = original_method
def test_clean_results_with_images(self, wrapper, mock_response_data):
# Test result cleaning functionality
raw_results = mock_response_data["search_result"]["results"]
cleaned_results = wrapper.clean_results_with_images(raw_results)
assert len(cleaned_results) == 3
# Test page result
page_result = cleaned_results[0]
assert page_result["type"] == "page"
assert page_result["title"] == "Test Title"
assert page_result["url"] == "https://example.com"
assert page_result["desc"] == "Test description"
# Test news result
news_result = cleaned_results[1]
assert news_result["type"] == "news"
assert news_result["time_frame"] == "2 days ago"
assert news_result["title"] == "Test News"
assert news_result["url"] == "https://example.com/news"
assert news_result["source"] == "Test Source"
# Test image result
image_result = cleaned_results[2]
assert image_result["type"] == "image_url"
assert image_result["image_url"] == "https://example.com/image.jpg"
assert image_result["image_description"] == "Test image description"
def test_clean_results_empty_categories(self, wrapper):
# Test result cleaning with empty categories
data = [
{
"content": {
"results": {
"organic": [],
"top_stories": {"items": []},
"images": {"items": []}
}
}
}
]
result = wrapper.clean_results_with_images(data)
assert len(result) == 0
def test_clean_results_url_deduplication(self, wrapper):
# Test URL deduplication functionality
data = [
{
"content": {
"results": {
"organic": [
{
"title": "Test Title 1",
"url": "https://example.com",
"desc": "Description 1"
},
{
"title": "Test Title 2",
"url": "https://example.com",
"desc": "Description 2"
}
]
}
}
}
]
result = wrapper.clean_results_with_images(data)
assert len(result) == 1
assert result[0]["title"] == "Test Title 1"

View File

@@ -0,0 +1,226 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
import json
from unittest.mock import Mock, patch
import pytest
class TestInfoQuestSearchResults:
@pytest.fixture
def search_tool(self):
"""Create a mock InfoQuestSearchResults instance."""
mock_tool = Mock()
mock_tool.time_range = 30
mock_tool.site = "example.com"
def mock_run(query, **kwargs):
sample_cleaned_results = [
{
"type": "page",
"title": "Test Title",
"url": "https://example.com",
"desc": "Test description"
}
]
sample_raw_results = {
"results": [
{
"content": {
"results": {
"organic": [
{
"title": "Test Title",
"url": "https://example.com",
"desc": "Test description"
}
]
}
}
}
]
}
return json.dumps(sample_cleaned_results, ensure_ascii=False), sample_raw_results
async def mock_arun(query, **kwargs):
return mock_run(query, **kwargs)
mock_tool._run = mock_run
mock_tool._arun = mock_arun
return mock_tool
@pytest.fixture
def sample_raw_results(self):
"""Sample raw results from InfoQuest API."""
return {
"results": [
{
"content": {
"results": {
"organic": [
{
"title": "Test Title",
"url": "https://example.com",
"desc": "Test description"
}
]
}
}
}
]
}
@pytest.fixture
def sample_cleaned_results(self):
"""Sample cleaned results."""
return [
{
"type": "page",
"title": "Test Title",
"url": "https://example.com",
"desc": "Test description"
}
]
def test_init_default_values(self):
"""Test initialization with default values using patch."""
with patch('src.tools.infoquest_search.infoquest_search_results.InfoQuestAPIWrapper') as mock_wrapper_class:
mock_instance = Mock()
mock_wrapper_class.return_value = mock_instance
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
with patch.object(InfoQuestSearchResults, '__init__', return_value=None) as mock_init:
InfoQuestSearchResults(infoquest_api_key="dummy-key")
mock_init.assert_called_once()
def test_init_custom_values(self):
"""Test initialization with custom values using patch."""
with patch('src.tools.infoquest_search.infoquest_search_results.InfoQuestAPIWrapper') as mock_wrapper_class:
mock_instance = Mock()
mock_wrapper_class.return_value = mock_instance
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
with patch.object(InfoQuestSearchResults, '__init__', return_value=None) as mock_init:
InfoQuestSearchResults(
time_range=10,
site="test.com",
infoquest_api_key="dummy-key"
)
mock_init.assert_called_once()
def test_run_success(
self,
search_tool,
sample_raw_results,
sample_cleaned_results,
):
"""Test successful synchronous run."""
result, raw = search_tool._run("test query")
assert isinstance(result, str)
assert isinstance(raw, dict)
assert "results" in raw
result_data = json.loads(result)
assert isinstance(result_data, list)
assert len(result_data) > 0
def test_run_exception(self, search_tool):
"""Test synchronous run with exception."""
original_run = search_tool._run
def mock_run_with_error(query, **kwargs):
return json.dumps({"error": "API Error"}, ensure_ascii=False), {}
try:
search_tool._run = mock_run_with_error
result, raw = search_tool._run("test query")
result_dict = json.loads(result)
assert "error" in result_dict
assert "API Error" in result_dict["error"]
assert raw == {}
finally:
search_tool._run = original_run
@pytest.mark.asyncio
async def test_arun_success(
self,
search_tool,
sample_raw_results,
sample_cleaned_results,
):
"""Test successful asynchronous run."""
result, raw = await search_tool._arun("test query")
assert isinstance(result, str)
assert isinstance(raw, dict)
assert "results" in raw
@pytest.mark.asyncio
async def test_arun_exception(self, search_tool):
"""Test asynchronous run with exception."""
original_arun = search_tool._arun
async def mock_arun_with_error(query, **kwargs):
return json.dumps({"error": "Async API Error"}, ensure_ascii=False), {}
try:
search_tool._arun = mock_arun_with_error
result, raw = await search_tool._arun("test query")
result_dict = json.loads(result)
assert "error" in result_dict
assert "Async API Error" in result_dict["error"]
assert raw == {}
finally:
search_tool._arun = original_arun
def test_run_with_run_manager(
self,
search_tool,
sample_raw_results,
sample_cleaned_results,
):
"""Test run with callback manager."""
mock_run_manager = Mock()
result, raw = search_tool._run("test query", run_manager=mock_run_manager)
assert isinstance(result, str)
assert isinstance(raw, dict)
@pytest.mark.asyncio
async def test_arun_with_run_manager(
self,
search_tool,
sample_raw_results,
sample_cleaned_results,
):
"""Test async run with callback manager."""
mock_run_manager = Mock()
result, raw = await search_tool._arun("test query", run_manager=mock_run_manager)
assert isinstance(result, str)
assert isinstance(raw, dict)
def test_api_wrapper_initialization_with_key(self):
"""Test API wrapper initialization with key."""
with patch('src.tools.infoquest_search.infoquest_search_results.InfoQuestAPIWrapper') as mock_wrapper_class:
mock_instance = Mock()
mock_wrapper_class.return_value = mock_instance
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
with patch.object(InfoQuestSearchResults, '__init__', return_value=None) as mock_init:
InfoQuestSearchResults(infoquest_api_key="test-key")
mock_init.assert_called_once()