mirror of
https://gitee.com/wanwujie/deer-flow
synced 2026-04-03 06:12:14 +08:00
feat: support infoquest (#708)
* support infoquest * support html checker * support html checker * change line break format * change line break format * change line break format * change line break format * change line break format * change line break format * change line break format * change line break format * Fix several critical issues in the codebase - Resolve crawler panic by improving error handling - Fix plan validation to prevent invalid configurations - Correct InfoQuest crawler JSON conversion logic * add test for infoquest * add test for infoquest * Add InfoQuest introduction to the README * add test for infoquest * fix readme for infoquest * fix readme for infoquest * resolve the conflict * resolve the conflict * resolve the conflict * Fix formatting of INFOQUEST in SearchEngine enum * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Willem Jiang <143703838+willem-bd@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e179fb1632
commit
7ec9e45702
@@ -24,9 +24,10 @@ ENABLE_MCP_SERVER_CONFIGURATION=false
|
||||
# Otherwise, you system could be compromised.
|
||||
ENABLE_PYTHON_REPL=false
|
||||
|
||||
# Search Engine, Supported values: tavily (recommended), duckduckgo, brave_search, arxiv, searx
|
||||
# Search Engine, Supported values: tavily, infoquest (recommended), duckduckgo, brave_search, arxiv, searx
|
||||
SEARCH_API=tavily
|
||||
TAVILY_API_KEY=tvly-xxx
|
||||
INFOQUEST_API_KEY="infoquest-xxx"
|
||||
# SEARX_HOST=xxx # Required only if SEARCH_API is searx.(compatible with both Searx and SearxNG)
|
||||
# BRAVE_SEARCH_API_KEY=xxx # Required only if SEARCH_API is brave_search
|
||||
# JINA_API_KEY=jina_xxx # Optional, default is None
|
||||
|
||||
37
README.md
37
README.md
@@ -14,6 +14,7 @@
|
||||
|
||||
Currently, DeerFlow has officially entered the [FaaS Application Center of Volcengine](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market). Users can experience it online through the [experience link](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market/deerflow/?channel=github&source=deerflow) to intuitively feel its powerful functions and convenient operations. At the same time, to meet the deployment needs of different users, DeerFlow supports one-click deployment based on Volcengine. Click the [deployment link](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/application/create?templateId=683adf9e372daa0008aaed5c&channel=github&source=deerflow) to quickly complete the deployment process and start an efficient research journey.
|
||||
|
||||
DeerFlow has newly integrated the intelligent search and crawling toolset independently developed by BytePlus--[InfoQuest (supports free online experience)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
Please visit [our official website](https://deerflow.tech/) for more details.
|
||||
|
||||
@@ -159,6 +160,13 @@ DeerFlow supports multiple search engines that can be configured in your `.env`
|
||||
- Requires `TAVILY_API_KEY` in your `.env` file
|
||||
- Sign up at: https://app.tavily.com/home
|
||||
|
||||
- **InfoQuest** (recommended): AI-optimized intelligent search and crawling toolset independently developed by BytePlus
|
||||
- Requires `INFOQUEST_API_KEY` in your `.env` file
|
||||
- Support for time range filtering and site filtering
|
||||
- Provides high-quality search results and content extraction
|
||||
- Sign up at: https://console.byteplus.com/infoquest/infoquests
|
||||
- Visit https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest to learn more
|
||||
|
||||
- **DuckDuckGo**: Privacy-focused search engine
|
||||
- No API key required
|
||||
|
||||
@@ -177,10 +185,31 @@ DeerFlow supports multiple search engines that can be configured in your `.env`
|
||||
To configure your preferred search engine, set the `SEARCH_API` variable in your `.env` file:
|
||||
|
||||
```bash
|
||||
# Choose one: tavily, duckduckgo, brave_search, arxiv
|
||||
# Choose one: tavily, infoquest, duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### Crawling Tools
|
||||
|
||||
DeerFlow supports multiple crawling tools that can be configured in your `conf.yaml` file:
|
||||
|
||||
- **Jina** (default): Freely accessible web content crawling tool
|
||||
|
||||
- **InfoQuest** (recommended): AI-optimized intelligent search and crawling toolset developed by BytePlus
|
||||
- Requires `INFOQUEST_API_KEY` in your `.env` file
|
||||
- Provides configurable crawling parameters
|
||||
- Supports custom timeout settings
|
||||
- Offers more powerful content extraction capabilities
|
||||
- Visit https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest to learn more
|
||||
|
||||
To configure your preferred crawling tool, set the following in your `conf.yaml` file:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# Engine type: "jina" (default) or "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
### Private Knowledgebase
|
||||
|
||||
DeerFlow supports private knowledgebase such as RAGFlow, Qdrant, Milvus, and VikingDB, so that you can use your private documents to answer questions.
|
||||
@@ -221,8 +250,8 @@ DeerFlow supports private knowledgebase such as RAGFlow, Qdrant, Milvus, and Vik
|
||||
### Tools and MCP Integrations
|
||||
|
||||
- 🔍 **Search and Retrieval**
|
||||
- Web search via Tavily, Brave Search and more
|
||||
- Crawling with Jina
|
||||
- Web search via Tavily, InfoQuest, Brave Search and more
|
||||
- Crawling with Jina and InfoQuest
|
||||
- Advanced content extraction
|
||||
- Support for private knowledgebase
|
||||
|
||||
@@ -284,7 +313,6 @@ The system employs a streamlined workflow with the following components:
|
||||
- Manages the research flow and decides when to generate the final report
|
||||
|
||||
3. **Research Team**: A collection of specialized agents that execute the plan:
|
||||
|
||||
- **Researcher**: Conducts web searches and information gathering using tools like web search engines, crawling and even MCP services.
|
||||
- **Coder**: Handles code analysis, execution, and technical tasks using Python REPL tool.
|
||||
Each agent has access to specific tools optimized for their role and operates within the LangGraph framework
|
||||
@@ -475,7 +503,6 @@ docker build -t deer-flow-api .
|
||||
```
|
||||
|
||||
Final, start up a docker container running the web server:
|
||||
|
||||
```bash
|
||||
# Replace deer-flow-api-app with your preferred container name
|
||||
# Start the server then bind to localhost:8000
|
||||
|
||||
38
README_de.md
38
README_de.md
@@ -13,6 +13,8 @@
|
||||
|
||||
Derzeit ist DeerFlow offiziell in das [FaaS-Anwendungszentrum von Volcengine](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market) eingezogen. Benutzer können es über den [Erfahrungslink](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market/deerflow/?channel=github&source=deerflow) online erleben, um seine leistungsstarken Funktionen und bequemen Operationen intuitiv zu spüren. Gleichzeitig unterstützt DeerFlow zur Erfüllung der Bereitstellungsanforderungen verschiedener Benutzer die Ein-Klick-Bereitstellung basierend auf Volcengine. Klicken Sie auf den [Bereitstellungslink](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/application/create?templateId=683adf9e372daa0008aaed5c&channel=github&source=deerflow), um den Bereitstellungsprozess schnell abzuschließen und eine effiziente Forschungsreise zu beginnen.
|
||||
|
||||
DeerFlow hat neu die intelligente Such- und Crawling-Toolset von BytePlus integriert - [InfoQuest (unterstützt kostenlose Online-Erfahrung)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
Besuchen Sie [unsere offizielle Website](https://deerflow.tech/) für weitere Details.
|
||||
|
||||
## Demo
|
||||
@@ -156,6 +158,13 @@ DeerFlow unterstützt mehrere Suchmaschinen, die in Ihrer `.env`-Datei über die
|
||||
- Erfordert `TAVILY_API_KEY` in Ihrer `.env`-Datei
|
||||
- Registrieren Sie sich unter: https://app.tavily.com/home
|
||||
|
||||
- **InfoQuest** (empfohlen): Ein KI-optimiertes intelligentes Such- und Crawling-Toolset, entwickelt von BytePlus
|
||||
- Erfordert `INFOQUEST_API_KEY` in Ihrer `.env`-Datei
|
||||
- Unterstützung für Zeitbereichsfilterung und Seitenfilterung
|
||||
- Bietet qualitativ hochwertige Suchergebnisse und Inhaltsextraktion
|
||||
- Registrieren Sie sich unter: https://console.byteplus.com/infoquest/infoquests
|
||||
- Besuchen Sie https://docs.byteplus.com/de/docs/InfoQuest/What_is_Info_Quest für weitere Informationen
|
||||
|
||||
- **DuckDuckGo**: Datenschutzorientierte Suchmaschine
|
||||
- Kein API-Schlüssel erforderlich
|
||||
|
||||
@@ -174,10 +183,32 @@ DeerFlow unterstützt mehrere Suchmaschinen, die in Ihrer `.env`-Datei über die
|
||||
Um Ihre bevorzugte Suchmaschine zu konfigurieren, setzen Sie die Variable `SEARCH_API` in Ihrer `.env`-Datei:
|
||||
|
||||
```bash
|
||||
# Wählen Sie eine: tavily, duckduckgo, brave_search, arxiv
|
||||
# Wählen Sie eine: tavily, infoquest, duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### Crawling-Tools
|
||||
|
||||
- **Jina** (Standard): Kostenloses, zugängliches Webinhalts-Crawling-Tool
|
||||
- Kein API-Schlüssel erforderlich für grundlegende Funktionen
|
||||
- Mit API-Schlüssel erhalten Sie höhere Zugriffsraten
|
||||
- Weitere Informationen unter <https://jina.ai/reader>
|
||||
|
||||
- **InfoQuest** (empfohlen): KI-optimiertes intelligentes Such- und Crawling-Toolset, entwickelt von BytePlus
|
||||
- Erfordert `INFOQUEST_API_KEY` in Ihrer `.env`-Datei
|
||||
- Bietet konfigurierbare Crawling-Parameter
|
||||
- Unterstützt benutzerdefinierte Timeout-Einstellungen
|
||||
- Bietet stärkere Inhaltsextraktionsfähigkeiten
|
||||
- Weitere Informationen unter <https://docs.byteplus.com/de/docs/InfoQuest/What_is_Info_Quest>
|
||||
|
||||
Um Ihr bevorzugtes Crawling-Tool zu konfigurieren, setzen Sie Folgendes in Ihrer `conf.yaml`-Datei:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# Engine-Typ: "jina" (Standard) oder "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
### Private Wissensbasis
|
||||
|
||||
DeerFlow unterstützt private Wissensbasen wie RAGFlow und VikingDB, sodass Sie Ihre privaten Dokumente zur Beantwortung von Fragen verwenden können.
|
||||
@@ -205,8 +236,8 @@ DeerFlow unterstützt private Wissensbasen wie RAGFlow und VikingDB, sodass Sie
|
||||
### Tools und MCP-Integrationen
|
||||
|
||||
- 🔍 **Suche und Abruf**
|
||||
- Websuche über Tavily, Brave Search und mehr
|
||||
- Crawling mit Jina
|
||||
- Websuche über Tavily, InfoQuest, Brave Search und mehr
|
||||
- Crawling mit Jina und InfoQuest
|
||||
- Fortgeschrittene Inhaltsextraktion
|
||||
- Unterstützung für private Wissensbasis
|
||||
|
||||
@@ -505,7 +536,6 @@ Die Anwendung unterstützt jetzt einen interaktiven Modus mit eingebauten Fragen
|
||||
4. Das System wird Ihre Frage verarbeiten und einen umfassenden Forschungsbericht generieren
|
||||
|
||||
### Mensch-in-der-Schleife
|
||||
|
||||
DeerFlow enthält einen Mensch-in-der-Schleife-Mechanismus, der es Ihnen ermöglicht, Forschungspläne vor ihrer Ausführung zu überprüfen, zu bearbeiten und zu genehmigen:
|
||||
|
||||
1. **Planüberprüfung**: Wenn Mensch-in-der-Schleife aktiviert ist, präsentiert das System den generierten Forschungsplan zur Überprüfung vor der Ausführung
|
||||
|
||||
37
README_es.md
37
README_es.md
@@ -13,6 +13,8 @@
|
||||
|
||||
Actualmente, DeerFlow ha ingresado oficialmente al Centro de Aplicaciones FaaS de Volcengine. Los usuarios pueden experimentarlo en línea a través del enlace de experiencia para sentir intuitivamente sus potentes funciones y operaciones convenientes. Al mismo tiempo, para satisfacer las necesidades de implementación de diferentes usuarios, DeerFlow admite la implementación con un clic basada en Volcengine. Haga clic en el enlace de implementación para completar rápidamente el proceso de implementación y comenzar un viaje de investigación eficiente.
|
||||
|
||||
DeerFlow ha integrado recientemente el conjunto de herramientas de búsqueda y rastreo inteligente desarrollado independientemente por BytePlus - [InfoQuest (admite experiencia gratuita en línea)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
Por favor, visita [nuestra página web oficial](https://deerflow.tech/) para más detalles.
|
||||
|
||||
## Demostración
|
||||
@@ -155,6 +157,13 @@ DeerFlow soporta múltiples motores de búsqueda que pueden configurarse en tu a
|
||||
- Requiere `TAVILY_API_KEY` en tu archivo `.env`
|
||||
- Regístrate en: <https://app.tavily.com/home>
|
||||
|
||||
- **InfoQuest** (recomendado): Un conjunto de herramientas inteligentes de búsqueda y rastreo optimizadas para IA, desarrollado por BytePlus
|
||||
- Requiere `INFOQUEST_API_KEY` en tu archivo `.env`
|
||||
- Soporte para filtrado por rango de fecha y filtrado de sitios web
|
||||
- Proporciona resultados de búsqueda y extracción de contenido de alta calidad
|
||||
- Regístrate en: <https://console.byteplus.com/infoquest/infoquests>
|
||||
- Visita https://docs.byteplus.com/es/docs/InfoQuest/What_is_Info_Quest para obtener más información
|
||||
|
||||
- **DuckDuckGo**: Motor de búsqueda centrado en la privacidad
|
||||
|
||||
- No requiere clave API
|
||||
@@ -175,10 +184,32 @@ DeerFlow soporta múltiples motores de búsqueda que pueden configurarse en tu a
|
||||
Para configurar tu motor de búsqueda preferido, establece la variable `SEARCH_API` en tu archivo `.env`:
|
||||
|
||||
```bash
|
||||
# Elige uno: tavily, duckduckgo, brave_search, arxiv
|
||||
# Elige uno: tavily, infoquest, duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### Herramientas de Rastreo
|
||||
|
||||
- **Jina** (predeterminado): Herramienta gratuita de rastreo de contenido web accesible
|
||||
- No se requiere clave API para usar funciones básicas
|
||||
- Al usar una clave API, se obtienen límites de tasa de acceso más altos
|
||||
- Visite <https://jina.ai/reader> para obtener más información
|
||||
|
||||
- **InfoQuest** (recomendado): Conjunto de herramientas inteligentes de búsqueda y rastreo optimizadas para IA, desarrollado por BytePlus
|
||||
- Requiere `INFOQUEST_API_KEY` en tu archivo `.env`
|
||||
- Proporciona parámetros de rastreo configurables
|
||||
- Admite configuración de tiempo de espera personalizada
|
||||
- Ofrece capacidades más potentes de extracción de contenido
|
||||
- Visita <https://docs.byteplus.com/es/docs/InfoQuest/What_is_Info_Quest> para obtener más información
|
||||
|
||||
Para configurar su herramienta de rastreo preferida, establezca lo siguiente en su archivo `conf.yaml`:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# Tipo de motor: "jina" (predeterminado) o "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
## Características
|
||||
|
||||
### Capacidades Principales
|
||||
@@ -193,8 +224,8 @@ SEARCH_API=tavily
|
||||
|
||||
- 🔍 **Búsqueda y Recuperación**
|
||||
|
||||
- Búsqueda web a través de Tavily, Brave Search y más
|
||||
- Rastreo con Jina
|
||||
- Búsqueda web a través de Tavily, InfoQuest, Brave Search y más
|
||||
- Rastreo con Jina e InfoQuest
|
||||
- Extracción avanzada de contenido
|
||||
|
||||
- 🔗 **Integración Perfecta con MCP**
|
||||
|
||||
37
README_ja.md
37
README_ja.md
@@ -11,6 +11,8 @@
|
||||
|
||||
現在、DeerFlow は火山引擎の FaaS アプリケーションセンターに正式に入居しています。ユーザーは体験リンクを通じてオンラインで体験し、その強力な機能と便利な操作を直感的に感じることができます。同時に、さまざまなユーザーの展開ニーズを満たすため、DeerFlow は火山引擎に基づくワンクリック展開をサポートしています。展開リンクをクリックして展開プロセスを迅速に完了し、効率的な研究の旅を始めましょう。
|
||||
|
||||
DeerFlow は新たにBytePlusが自主開発したインテリジェント検索・クローリングツールセットを統合しました--[InfoQuest (オンライン無料体験をサポート)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
詳細については[DeerFlow の公式ウェブサイト](https://deerflow.tech/)をご覧ください。
|
||||
|
||||
## デモ
|
||||
@@ -151,6 +153,13 @@ DeerFlow は複数の検索エンジンをサポートしており、`.env`フ
|
||||
- `.env`ファイルに`TAVILY_API_KEY`が必要
|
||||
- 登録先:<https://app.tavily.com/home>
|
||||
|
||||
- **InfoQuest**(推奨):BytePlusが開発したAI最適化のインテリジェント検索とクローリングツールセット
|
||||
- `.env`ファイルに`INFOQUEST_API_KEY`が必要
|
||||
- 時間範囲フィルタリングとサイトフィルタリングをサポート
|
||||
- 高品質な検索結果とコンテンツ抽出を提供
|
||||
- 登録先:<https://console.byteplus.com/infoquest/infoquests>
|
||||
- ドキュメント:<https://docs.byteplus.com/ja/docs/InfoQuest/What_is_Info_Quest>
|
||||
|
||||
- **DuckDuckGo**:プライバシー重視の検索エンジン
|
||||
- APIキー不要
|
||||
|
||||
@@ -169,10 +178,32 @@ DeerFlow は複数の検索エンジンをサポートしており、`.env`フ
|
||||
お好みの検索エンジンを設定するには、`.env`ファイルで`SEARCH_API`変数を設定します:
|
||||
|
||||
```bash
|
||||
# 選択肢: tavily, duckduckgo, brave_search, arxiv
|
||||
# 選択肢: tavily, infoquest, duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### クローリングツール
|
||||
|
||||
- **Jina**(デフォルト):無料でアクセス可能なウェブコンテンツクローリングツール
|
||||
- 基本機能を使用するにはAPIキーは不要
|
||||
- APIキーを使用するとより高いアクセスレート制限が適用されます
|
||||
- 詳細については <https://jina.ai/reader> を参照してください
|
||||
|
||||
- **InfoQuest**(推奨):BytePlusが開発したAI最適化のインテリジェント検索とクローリングツールセット
|
||||
- `.env`ファイルに`INFOQUEST_API_KEY`が必要
|
||||
- 設定可能なクローリングパラメータを提供
|
||||
- カスタムタイムアウト設定をサポート
|
||||
- より強力なコンテンツ抽出機能を提供
|
||||
- 詳細については <https://docs.byteplus.com/ja/docs/InfoQuest/What_is_Info_Quest> を参照してください
|
||||
|
||||
お好みのクローリングツールを設定するには、`conf.yaml`ファイルで以下を設定します:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# エンジンタイプ:"jina"(デフォルト)または "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
## 特徴
|
||||
|
||||
### コア機能
|
||||
@@ -186,8 +217,8 @@ SEARCH_API=tavily
|
||||
### ツールと MCP 統合
|
||||
|
||||
- 🔍 **検索と取得**
|
||||
- Tavily、Brave Searchなどを通じたWeb検索
|
||||
- Jinaを使用したクローリング
|
||||
- Tavily、InfoQuest、Brave Searchなどを通じたWeb検索
|
||||
- JinaとInfoQuestを使用したクローリング
|
||||
- 高度なコンテンツ抽出
|
||||
|
||||
- 🔗 **MCPシームレス統合**
|
||||
|
||||
37
README_pt.md
37
README_pt.md
@@ -14,6 +14,8 @@
|
||||
|
||||
Atualmente, o DeerFlow entrou oficialmente no Centro de Aplicações FaaS da Volcengine. Os usuários podem experimentá-lo online através do link de experiência para sentir intuitivamente suas funções poderosas e operações convenientes. Ao mesmo tempo, para atender às necessidades de implantação de diferentes usuários, o DeerFlow suporta implantação com um clique baseada na Volcengine. Clique no link de implantação para completar rapidamente o processo de implantação e iniciar uma jornada de pesquisa eficiente.
|
||||
|
||||
O DeerFlow recentemente integrou o conjunto de ferramentas de busca e rastreamento inteligente desenvolvido independentemente pela BytePlus — [InfoQuest (oferece experiência gratuita online)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
Por favor, visite [Nosso Site Oficial](https://deerflow.tech/) para maiores detalhes.
|
||||
|
||||
## Demo
|
||||
@@ -158,6 +160,13 @@ DeerFlow suporta múltiplos mecanismos de busca que podem ser configurados no se
|
||||
- Requer `TAVILY_API_KEY` no seu arquivo `.env`
|
||||
- Inscreva-se em: <https://app.tavily.com/home>
|
||||
|
||||
- **InfoQuest** (recomendado): Um conjunto de ferramentas inteligentes de busca e crawling otimizadas para IA, desenvolvido pela BytePlus
|
||||
- Requer `INFOQUEST_API_KEY` no seu arquivo `.env`
|
||||
- Suporte para filtragem por intervalo de tempo e filtragem de sites
|
||||
- Fornece resultados de busca e extração de conteúdo de alta qualidade
|
||||
- Inscreva-se em: <https://console.byteplus.com/infoquest/infoquests>
|
||||
- Visite https://docs.byteplus.com/pt/docs/InfoQuest/What_is_Info_Quest para obter mais informações
|
||||
|
||||
- **DuckDuckGo**: Mecanismo de busca focado em privacidade
|
||||
|
||||
- Não requer chave API
|
||||
@@ -178,10 +187,32 @@ DeerFlow suporta múltiplos mecanismos de busca que podem ser configurados no se
|
||||
Para configurar o seu mecanismo preferido, defina a variável `SEARCH_API` no seu arquivo:
|
||||
|
||||
```bash
|
||||
# Escolha uma: tavily, duckduckgo, brave_search, arxiv
|
||||
# Escolha uma: tavily, infoquest, duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### Ferramentas de Crawling
|
||||
|
||||
- **Jina** (padrão): Ferramenta gratuita de crawling de conteúdo web acessível
|
||||
- Não é necessária chave API para usar recursos básicos
|
||||
- Ao usar uma chave API, você obtém limites de taxa de acesso mais altos
|
||||
- Visite <https://jina.ai/reader> para obter mais informações
|
||||
|
||||
- **InfoQuest** (recomendado): Conjunto de ferramentas inteligentes de busca e crawling otimizadas para IA, desenvolvido pela BytePlus
|
||||
- Requer `INFOQUEST_API_KEY` no seu arquivo `.env`
|
||||
- Fornece parâmetros de crawling configuráveis
|
||||
- Suporta configurações de timeout personalizadas
|
||||
- Oferece capacidades mais poderosas de extração de conteúdo
|
||||
- Visite <https://docs.byteplus.com/pt/docs/InfoQuest/What_is_Info_Quest> para obter mais informações
|
||||
|
||||
Para configurar sua ferramenta de crawling preferida, defina o seguinte em seu arquivo `conf.yaml`:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# Tipo de mecanismo: "jina" (padrão) ou "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
## Funcionalidades
|
||||
|
||||
### Principais Funcionalidades
|
||||
@@ -197,8 +228,8 @@ SEARCH_API=tavily
|
||||
|
||||
- 🔍 **Busca e Recuperação**
|
||||
|
||||
- Busca web com Tavily, Brave Search e mais
|
||||
- Crawling com Jina
|
||||
- Busca web com Tavily, InfoQuest, Brave Search e mais
|
||||
- Crawling com Jina e InfoQuest
|
||||
- Extração de Conteúdo avançada
|
||||
|
||||
- 🔗 **Integração MCP perfeita**
|
||||
|
||||
37
README_ru.md
37
README_ru.md
@@ -13,6 +13,8 @@
|
||||
|
||||
В настоящее время DeerFlow официально вошел в Центр приложений FaaS Volcengine. Пользователи могут испытать его онлайн через ссылку для опыта, чтобы интуитивно почувствовать его мощные функции и удобные операции. В то же время, для удовлетворения потребностей развертывания различных пользователей, DeerFlow поддерживает развертывание одним кликом на основе Volcengine. Нажмите на ссылку развертывания, чтобы быстро завершить процесс развертывания и начать эффективное исследовательское путешествие.
|
||||
|
||||
DeerFlow недавно интегрировал интеллектуальный набор инструментов поиска и краулинга, разработанный самостоятельно компанией BytePlus — [InfoQuest (поддерживает бесплатное онлайн-опробование)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
Пожалуйста, посетите [наш официальный сайт](https://deerflow.tech/) для получения дополнительной информации.
|
||||
|
||||
## Демонстрация
|
||||
@@ -155,6 +157,13 @@ DeerFlow поддерживает несколько поисковых сист
|
||||
- Требуется `TAVILY_API_KEY` в вашем файле `.env`
|
||||
- Зарегистрируйтесь на: <https://app.tavily.com/home>
|
||||
|
||||
- **InfoQuest** (рекомендуется): Набор интеллектуальных инструментов для поиска и сканирования, оптимизированных для ИИ, разработанный компанией BytePlus
|
||||
- Требуется `INFOQUEST_API_KEY` в вашем файле `.env`
|
||||
- Поддержка фильтрации по диапазону времени и фильтрации сайтов
|
||||
- Предоставляет высококачественные результаты поиска и извлечение контента
|
||||
- Зарегистрируйтесь на: <https://console.byteplus.com/infoquest/infoquests>
|
||||
- Посетите https://docs.byteplus.com/ru/docs/InfoQuest/What_is_Info_Quest для получения дополнительной информации
|
||||
|
||||
- **DuckDuckGo**: Поисковая система, ориентированная на конфиденциальность
|
||||
|
||||
- Не требуется API-ключ
|
||||
@@ -175,10 +184,32 @@ DeerFlow поддерживает несколько поисковых сист
|
||||
Чтобы настроить предпочитаемую поисковую систему, установите переменную `SEARCH_API` в вашем файле `.env`:
|
||||
|
||||
```bash
|
||||
# Выберите одно: tavily, duckduckgo, brave_search, arxiv
|
||||
# Выберите одно: tavily, infoquest, duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### Инструменты сканирования
|
||||
|
||||
- **Jina** (по умолчанию): Бесплатный доступный инструмент для сканирования веб-контента
|
||||
- API-ключ не требуется для использования базовых функций
|
||||
- При использовании API-ключа вы получаете более высокие лимиты скорости доступа
|
||||
- Посетите <https://jina.ai/reader> для получения дополнительной информации
|
||||
|
||||
- **InfoQuest** (рекомендуется): Набор интеллектуальных инструментов для поиска и сканирования, оптимизированных для ИИ, разработанный компанией BytePlus
|
||||
- Требуется `INFOQUEST_API_KEY` в вашем файле `.env`
|
||||
- Предоставляет настраиваемые параметры сканирования
|
||||
- Поддерживает настройки пользовательских тайм-аутов
|
||||
- Предоставляет более мощные возможности извлечения контента
|
||||
- Посетите <https://docs.byteplus.com/ru/docs/InfoQuest/What_is_Info_Quest> для получения дополнительной информации
|
||||
|
||||
Чтобы настроить предпочитаемый инструмент сканирования, установите следующее в вашем файле `conf.yaml`:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# Тип движка: "jina" (по умолчанию) или "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
## Особенности
|
||||
|
||||
### Ключевые возможности
|
||||
@@ -193,8 +224,8 @@ SEARCH_API=tavily
|
||||
|
||||
- 🔍 **Поиск и извлечение**
|
||||
|
||||
- Веб-поиск через Tavily, Brave Search и другие
|
||||
- Сканирование с Jina
|
||||
- Веб-поиск через Tavily, InfoQuest, Brave Search и другие
|
||||
- Сканирование с Jina и InfoQuest
|
||||
- Расширенное извлечение контента
|
||||
|
||||
- 🔗 **Бесшовная интеграция MCP**
|
||||
|
||||
37
README_zh.md
37
README_zh.md
@@ -11,6 +11,8 @@
|
||||
|
||||
目前,DeerFlow 已正式入驻[火山引擎的 FaaS 应用中心](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market),用户可通过[体验链接](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/market/deerflow/?channel=github&source=deerflow)进行在线体验,直观感受其强大功能与便捷操作;同时,为满足不同用户的部署需求,DeerFlow 支持基于火山引擎一键部署,点击[部署链接](https://console.volcengine.com/vefaas/region:vefaas+cn-beijing/application/create?templateId=683adf9e372daa0008aaed5c&channel=github&source=deerflow)即可快速完成部署流程,开启高效研究之旅。
|
||||
|
||||
DeerFlow 新接入BytePlus自主推出的智能搜索与爬取工具集--[InfoQuest(支持在线免费体验)](https://console.byteplus.com/infoquest/infoquests)
|
||||
|
||||
请访问[DeerFlow 的官方网站](https://deerflow.tech/)了解更多详情。
|
||||
|
||||
## 演示
|
||||
@@ -152,6 +154,13 @@ DeerFlow 支持多种搜索引擎,可以在`.env`文件中通过`SEARCH_API`
|
||||
- 需要在`.env`文件中设置`TAVILY_API_KEY`
|
||||
- 注册地址:<https://app.tavily.com/home>
|
||||
|
||||
- **InfoQuest**(推荐):BytePlus自主研发的专为AI应用优化的智能搜索与爬取工具集
|
||||
- 需要在`.env`文件中设置`INFOQUEST_API_KEY`
|
||||
- 支持时间范围过滤和站点过滤
|
||||
- 提供高质量的搜索结果和内容提取
|
||||
- 注册地址:<https://console.byteplus.com/infoquest/infoquests>
|
||||
- 访问 <https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest> 了解更多信息
|
||||
|
||||
- **DuckDuckGo**:注重隐私的搜索引擎
|
||||
- 无需 API 密钥
|
||||
|
||||
@@ -170,10 +179,32 @@ DeerFlow 支持多种搜索引擎,可以在`.env`文件中通过`SEARCH_API`
|
||||
要配置您首选的搜索引擎,请在`.env`文件中设置`SEARCH_API`变量:
|
||||
|
||||
```bash
|
||||
# 选择一个:tavily, duckduckgo, brave_search, arxiv
|
||||
# 选择一个:tavily, infoquest,duckduckgo, brave_search, arxiv
|
||||
SEARCH_API=tavily
|
||||
```
|
||||
|
||||
### 爬取工具
|
||||
|
||||
- **Jina**(默认):免费可访问的网页内容爬取工具
|
||||
- 无需 API 密钥即可使用基础功能
|
||||
- 使用 API 密钥可获得更高的访问速率限制
|
||||
- 访问 <https://jina.ai/reader> 了解更多信息
|
||||
|
||||
- **InfoQuest**(推荐):BytePlus自主研发的专为AI应用优化的智能搜索与爬取工具集
|
||||
- 需要在`.env`文件中设置`INFOQUEST_API_KEY`
|
||||
- 提供可配置的爬取参数
|
||||
- 支持自定义超时设置
|
||||
- 提供更强大的内容提取能力
|
||||
- 访问 <https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest> 了解更多信息
|
||||
|
||||
要配置您首选的爬取工具,请在`conf.yaml`文件中设置:
|
||||
|
||||
```yaml
|
||||
CRAWLER_ENGINE:
|
||||
# 引擎类型:"jina"(默认)或 "infoquest"
|
||||
engine: infoquest
|
||||
```
|
||||
|
||||
### 私域知识库引擎
|
||||
|
||||
DeerFlow 支持基于私有域知识的检索,您可以将文档上传到多种私有知识库中,以便在研究过程中使用,当前支持的私域知识库有:
|
||||
@@ -221,8 +252,8 @@ DeerFlow 支持基于私有域知识的检索,您可以将文档上传到多
|
||||
### 工具和 MCP 集成
|
||||
|
||||
- 🔍 **搜索和检索**
|
||||
- 通过 Tavily、Brave Search 等进行网络搜索
|
||||
- 使用 Jina 进行爬取
|
||||
- 通过 Tavily、InfoQuest、Brave Search 等进行网络搜索
|
||||
- 使用 Jina、InfoQuest 进行爬取
|
||||
- 高级内容提取
|
||||
- 支持检索指定私有知识库
|
||||
|
||||
|
||||
@@ -61,9 +61,13 @@ BASIC_MODEL:
|
||||
# # When interrupt is triggered, user will be prompted to approve/reject
|
||||
# # Approved keywords: "approved", "approve", "yes", "proceed", "continue", "ok", "okay", "accepted", "accept"
|
||||
|
||||
# Search engine configuration (Only supports Tavily currently)
|
||||
# Search engine configuration
|
||||
# Supported engines: tavily, infoquest
|
||||
# SEARCH_ENGINE:
|
||||
# engine: tavily
|
||||
# # Engine type to use: "tavily" or "infoquest"
|
||||
# engine: tavily or infoquest
|
||||
#
|
||||
# # The following parameters are specific to Tavily
|
||||
# # Only include results from these domains
|
||||
# include_domains:
|
||||
# - example.com
|
||||
@@ -88,3 +92,28 @@ BASIC_MODEL:
|
||||
# min_score_threshold: 0.0
|
||||
# # Maximum content length per page
|
||||
# max_content_length_per_page: 4000
|
||||
#
|
||||
# # The following parameters are specific to InfoQuest
|
||||
# # Used to limit the scope of search results, only returns content within the specified time range. Set to -1 to disable time filtering
|
||||
# time_range: 30
|
||||
# # Used to limit the scope of search results, only returns content from specified whitelisted domains. Set to empty string to disable site filtering
|
||||
# site: "example.com"
|
||||
|
||||
|
||||
# Crawler engine configuration
|
||||
# Supported engines: jina (default), infoquest
|
||||
# Uncomment the following section to configure crawler engine
|
||||
# CRAWLER_ENGINE:
|
||||
# # Engine type to use: "jina" (default) or "infoquest"
|
||||
# engine: infoquest
|
||||
#
|
||||
# # The following timeout parameters are only effective when engine is set to "infoquest"
|
||||
# # Waiting time after page loading (in seconds)
|
||||
# # Set to positive value to enable, -1 to disable
|
||||
# fetch_time: 10
|
||||
# # Overall timeout for the entire crawling process (in seconds)
|
||||
# # Set to positive value to enable, -1 to disable
|
||||
# timeout: 30
|
||||
# # Timeout for navigating to the page (in seconds)
|
||||
# # Set to positive value to enable, -1 to disable
|
||||
# navi_timeout: 15
|
||||
@@ -11,6 +11,7 @@ load_dotenv()
|
||||
|
||||
class SearchEngine(enum.Enum):
|
||||
TAVILY = "tavily"
|
||||
INFOQUEST = "infoquest"
|
||||
DUCKDUCKGO = "duckduckgo"
|
||||
BRAVE_SEARCH = "brave_search"
|
||||
ARXIV = "arxiv"
|
||||
@@ -18,10 +19,14 @@ class SearchEngine(enum.Enum):
|
||||
WIKIPEDIA = "wikipedia"
|
||||
|
||||
|
||||
class CrawlerEngine(enum.Enum):
|
||||
JINA = "jina"
|
||||
INFOQUEST = "infoquest"
|
||||
|
||||
|
||||
# Tool configuration
|
||||
SELECTED_SEARCH_ENGINE = os.getenv("SEARCH_API", SearchEngine.TAVILY.value)
|
||||
|
||||
|
||||
class RAGProvider(enum.Enum):
|
||||
DIFY = "dify"
|
||||
RAGFLOW = "ragflow"
|
||||
|
||||
@@ -4,9 +4,12 @@
|
||||
import re
|
||||
import logging
|
||||
|
||||
from .article import Article
|
||||
from .jina_client import JinaClient
|
||||
from .readability_extractor import ReadabilityExtractor
|
||||
from src.config.tools import CrawlerEngine
|
||||
from src.config import load_yaml_config
|
||||
from src.crawler.article import Article
|
||||
from src.crawler.infoquest_client import InfoQuestClient
|
||||
from src.crawler.jina_client import JinaClient
|
||||
from src.crawler.readability_extractor import ReadabilityExtractor
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -138,17 +141,21 @@ class Crawler:
|
||||
# them into text and image blocks for one single and unified
|
||||
# LLM message.
|
||||
#
|
||||
# Jina is not the best crawler on readability, however it's
|
||||
# much easier and free to use.
|
||||
# The system supports multiple crawler engines:
|
||||
# - Jina: An accessible solution, though with some limitations in readability extraction
|
||||
# - InfoQuest: A BytePlus product offering advanced capabilities with configurable parameters
|
||||
# like fetch_time, timeout, and navi_timeout.
|
||||
#
|
||||
# Instead of using Jina's own markdown converter, we'll use
|
||||
# our own solution to get better readability results.
|
||||
try:
|
||||
jina_client = JinaClient()
|
||||
html = jina_client.crawl(url, return_format="html")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to fetch URL {url} from Jina: {repr(e)}")
|
||||
raise
|
||||
|
||||
# Get crawler configuration
|
||||
config = load_yaml_config("conf.yaml")
|
||||
crawler_config = config.get("CRAWLER_ENGINE", {})
|
||||
|
||||
# Get the selected crawler tool based on configuration
|
||||
crawler_client = self._select_crawler_tool(crawler_config)
|
||||
html = self._crawl_with_tool(crawler_client, url)
|
||||
|
||||
# Check if we got valid HTML content
|
||||
if not html or not html.strip():
|
||||
@@ -186,3 +193,44 @@ class Crawler:
|
||||
|
||||
article.url = url
|
||||
return article
|
||||
|
||||
def _select_crawler_tool(self, crawler_config: dict):
|
||||
# Only check engine from configuration file
|
||||
engine = crawler_config.get("engine", CrawlerEngine.JINA.value)
|
||||
|
||||
if engine == CrawlerEngine.JINA.value:
|
||||
logger.info(f"Selecting Jina crawler engine")
|
||||
return JinaClient()
|
||||
elif engine == CrawlerEngine.INFOQUEST.value:
|
||||
logger.info(f"Selecting InfoQuest crawler engine")
|
||||
# Read timeout parameters directly from crawler_config root level
|
||||
# These parameters are only effective when engine is set to "infoquest"
|
||||
fetch_time = crawler_config.get("fetch_time", -1)
|
||||
timeout = crawler_config.get("timeout", -1)
|
||||
navi_timeout = crawler_config.get("navi_timeout", -1)
|
||||
|
||||
# Log the configuration being used
|
||||
if fetch_time > 0 or timeout > 0 or navi_timeout > 0:
|
||||
logger.debug(
|
||||
f"Initializing InfoQuestCrawler with parameters: "
|
||||
f"fetch_time={fetch_time}, "
|
||||
f"timeout={timeout}, "
|
||||
f"navi_timeout={navi_timeout}"
|
||||
)
|
||||
|
||||
# Initialize InfoQuestClient with the parameters from configuration
|
||||
return InfoQuestClient(
|
||||
fetch_time=fetch_time,
|
||||
timeout=timeout,
|
||||
navi_timeout=navi_timeout
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Unsupported crawler engine: {engine}")
|
||||
|
||||
def _crawl_with_tool(self, crawler_client, url: str) -> str:
|
||||
logger.info(f"Crawling URL: {url} using {crawler_client.__class__.__name__}")
|
||||
try:
|
||||
return crawler_client.crawl(url, return_format="html")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to fetch URL {url} using {crawler_client.__class__.__name__}: {repr(e)}")
|
||||
raise
|
||||
153
src/crawler/infoquest_client.py
Normal file
153
src/crawler/infoquest_client.py
Normal file
@@ -0,0 +1,153 @@
|
||||
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
"""Util that calls InfoQuest Crawler API.
|
||||
|
||||
In order to set this up, follow instructions at:
|
||||
https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from typing import Dict, Any
|
||||
|
||||
import requests
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class InfoQuestClient:
|
||||
"""Client for interacting with the InfoQuest web crawling API."""
|
||||
|
||||
def __init__(self, fetch_time: int = -1, timeout: int = -1, navi_timeout: int = -1):
|
||||
logger.info(
|
||||
"\n============================================\n"
|
||||
"🚀 BytePlus InfoQuest Crawler Initialization 🚀\n"
|
||||
"============================================"
|
||||
)
|
||||
|
||||
self.fetch_time = fetch_time
|
||||
self.timeout = timeout
|
||||
self.navi_timeout = navi_timeout
|
||||
self.api_key_set = bool(os.getenv("INFOQUEST_API_KEY"))
|
||||
|
||||
config_details = (
|
||||
f"\n📋 Configuration Details:\n"
|
||||
f"├── Fetch Timeout: {fetch_time} {'(Default: No timeout)' if fetch_time == -1 else '(Custom)'}\n"
|
||||
f"├── Timeout: {timeout} {'(Default: No timeout)' if timeout == -1 else '(Custom)'}\n"
|
||||
f"├── Navigation Timeout: {navi_timeout} {'(Default: No timeout)' if navi_timeout == -1 else '(Custom)'}\n"
|
||||
f"└── API Key: {'✅ Configured' if self.api_key_set else '❌ Not set'}"
|
||||
)
|
||||
|
||||
logger.info(config_details)
|
||||
logger.info("\n" + "*" * 70 + "\n")
|
||||
|
||||
def crawl(self, url: str, return_format: str = "html") -> str:
|
||||
logger.debug("Preparing request for URL: %s", url)
|
||||
|
||||
# Prepare headers
|
||||
headers = self._prepare_headers()
|
||||
|
||||
# Prepare request data
|
||||
data = self._prepare_request_data(url, return_format)
|
||||
|
||||
# Log request details
|
||||
logger.debug(
|
||||
"InfoQuest Crawler request prepared: endpoint=https://reader.infoquest.bytepluses.com, "
|
||||
"format=%s, has_api_key=%s",
|
||||
data.get("format"), self.api_key_set
|
||||
)
|
||||
|
||||
logger.debug("Sending crawl request to InfoQuest API")
|
||||
try:
|
||||
response = requests.post(
|
||||
"https://reader.infoquest.bytepluses.com",
|
||||
headers=headers,
|
||||
json=data
|
||||
)
|
||||
|
||||
# Check if status code is not 200
|
||||
if response.status_code != 200:
|
||||
error_message = f"InfoQuest API returned status {response.status_code}: {response.text}"
|
||||
logger.error(error_message)
|
||||
return f"Error: {error_message}"
|
||||
|
||||
# Check for empty response
|
||||
if not response.text or not response.text.strip():
|
||||
error_message = "InfoQuest Crawler API returned empty response"
|
||||
logger.error("BytePlus InfoQuest Crawler returned empty response for URL: %s", url)
|
||||
return f"Error: {error_message}"
|
||||
|
||||
# Try to parse response as JSON and extract reader_result
|
||||
try:
|
||||
response_data = json.loads(response.text)
|
||||
# Extract reader_result if it exists
|
||||
if "reader_result" in response_data:
|
||||
logger.debug("Successfully extracted reader_result from JSON response")
|
||||
return response_data["reader_result"]
|
||||
elif "content" in response_data:
|
||||
# Fallback to content field if reader_result is not available
|
||||
logger.debug("Using content field as fallback")
|
||||
return response_data["content"]
|
||||
else:
|
||||
# If neither field exists, return the original response
|
||||
logger.warning("Neither reader_result nor content field found in JSON response")
|
||||
except json.JSONDecodeError:
|
||||
# If response is not JSON, return the original text
|
||||
logger.debug("Response is not in JSON format, returning as-is")
|
||||
|
||||
# Print partial response for debugging
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
response_sample = response.text[:200] + ("..." if len(response.text) > 200 else "")
|
||||
logger.debug(
|
||||
"Successfully received response, content length: %d bytes, first 200 chars: %s",
|
||||
len(response.text), response_sample
|
||||
)
|
||||
return response.text
|
||||
except Exception as e:
|
||||
error_message = f"Request to InfoQuest API failed: {str(e)}"
|
||||
logger.error(error_message)
|
||||
return f"Error: {error_message}"
|
||||
|
||||
def _prepare_headers(self) -> Dict[str, str]:
|
||||
"""Prepare request headers."""
|
||||
headers = {
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
# Add API key if available
|
||||
if os.getenv("INFOQUEST_API_KEY"):
|
||||
headers["Authorization"] = f"Bearer {os.getenv('INFOQUEST_API_KEY')}"
|
||||
logger.debug("API key added to request headers")
|
||||
else:
|
||||
logger.warning(
|
||||
"InfoQuest API key is not set. Provide your own key for authentication."
|
||||
)
|
||||
|
||||
return headers
|
||||
|
||||
def _prepare_request_data(self, url: str, return_format: str) -> Dict[str, Any]:
|
||||
"""Prepare request data with formatted parameters."""
|
||||
# Normalize return_format
|
||||
if return_format and return_format.lower() == "html":
|
||||
normalized_format = "HTML"
|
||||
else:
|
||||
normalized_format = return_format
|
||||
|
||||
data = {"url": url, "format": normalized_format}
|
||||
|
||||
# Add timeout parameters if set to positive values
|
||||
timeout_params = {}
|
||||
if self.fetch_time > 0:
|
||||
timeout_params["fetch_time"] = self.fetch_time
|
||||
if self.timeout > 0:
|
||||
timeout_params["timeout"] = self.timeout
|
||||
if self.navi_timeout > 0:
|
||||
timeout_params["navi_timeout"] = self.navi_timeout
|
||||
|
||||
# Log applied timeout parameters
|
||||
if timeout_params:
|
||||
logger.debug("Applying timeout parameters: %s", timeout_params)
|
||||
data.update(timeout_params)
|
||||
|
||||
return data
|
||||
@@ -22,12 +22,21 @@ class JinaClient:
|
||||
"Jina API key is not set. Provide your own key to access a higher rate limit. See https://jina.ai/reader for more information."
|
||||
)
|
||||
data = {"url": url}
|
||||
try:
|
||||
response = requests.post("https://r.jina.ai/", headers=headers, json=data)
|
||||
|
||||
if response.status_code != 200:
|
||||
raise ValueError(f"Jina API returned status {response.status_code}: {response.text}")
|
||||
error_message = f"Jina API returned status {response.status_code}: {response.text}"
|
||||
logger.error(error_message)
|
||||
return f"Error: {error_message}"
|
||||
|
||||
if not response.text or not response.text.strip():
|
||||
raise ValueError("Jina API returned empty response")
|
||||
error_message = "Jina API returned empty response"
|
||||
logger.error(error_message)
|
||||
return f"Error: {error_message}"
|
||||
|
||||
return response.text
|
||||
except Exception as e:
|
||||
error_message = f"Request to Jina API failed: {str(e)}"
|
||||
logger.error(error_message)
|
||||
return f"Error: {error_message}"
|
||||
4
src/tools/infoquest_search/__init__.py
Normal file
4
src/tools/infoquest_search/__init__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
from .infoquest_search_api import InfoQuestAPIWrapper
|
||||
from .infoquest_search_results import InfoQuestSearchResults
|
||||
|
||||
__all__ = ["InfoQuestAPIWrapper", "InfoQuestSearchResults"]
|
||||
232
src/tools/infoquest_search/infoquest_search_api.py
Normal file
232
src/tools/infoquest_search/infoquest_search_api.py
Normal file
@@ -0,0 +1,232 @@
|
||||
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
"""Util that calls InfoQuest Search API.
|
||||
|
||||
In order to set this up, follow instructions at:
|
||||
https://docs.byteplus.com/en/docs/InfoQuest/What_is_Info_Quest
|
||||
"""
|
||||
|
||||
import json
|
||||
from typing import Any, Dict, List
|
||||
|
||||
import aiohttp
|
||||
import requests
|
||||
from langchain_core.utils import get_from_dict_or_env
|
||||
from pydantic import BaseModel, ConfigDict, SecretStr, model_validator
|
||||
from src.config import load_yaml_config
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
INFOQUEST_API_URL = "https://search.infoquest.bytepluses.com"
|
||||
|
||||
def get_search_config():
|
||||
config = load_yaml_config("conf.yaml")
|
||||
search_config = config.get("SEARCH_ENGINE", {})
|
||||
return search_config
|
||||
|
||||
class InfoQuestAPIWrapper(BaseModel):
|
||||
"""Wrapper for InfoQuest Search API."""
|
||||
|
||||
infoquest_api_key: SecretStr
|
||||
model_config = ConfigDict(
|
||||
extra="forbid",
|
||||
)
|
||||
|
||||
@model_validator(mode="before")
|
||||
@classmethod
|
||||
def validate_environment(cls, values: Dict) -> Any:
|
||||
"""Validate that api key and endpoint exists in environment."""
|
||||
logger.info("Initializing BytePlus InfoQuest Product - Search API client")
|
||||
|
||||
infoquest_api_key = get_from_dict_or_env(
|
||||
values, "infoquest_api_key", "INFOQUEST_API_KEY"
|
||||
)
|
||||
values["infoquest_api_key"] = infoquest_api_key
|
||||
|
||||
logger.info("BytePlus InfoQuest Product - Environment validation successful")
|
||||
return values
|
||||
|
||||
def raw_results(
|
||||
self,
|
||||
query: str,
|
||||
time_range: int,
|
||||
site: str,
|
||||
output_format: str = "JSON",
|
||||
) -> Dict:
|
||||
"""Get results from the InfoQuest Search API synchronously."""
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
query_truncated = query[:50] + "..." if len(query) > 50 else query
|
||||
logger.debug(
|
||||
f"InfoQuest - Search API request initiated | "
|
||||
f"operation=search | "
|
||||
f"query_truncated={query_truncated} | "
|
||||
f"has_time_filter={time_range > 0} | "
|
||||
f"has_site_filter={bool(site)} | "
|
||||
f"request_type=sync"
|
||||
)
|
||||
|
||||
headers = {
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self.infoquest_api_key.get_secret_value()}",
|
||||
}
|
||||
|
||||
params = {
|
||||
"format": output_format,
|
||||
"query": query
|
||||
}
|
||||
if time_range > 0:
|
||||
params["time_range"] = time_range
|
||||
logger.debug(f"InfoQuest - Applying time range filter: time_range_days={time_range}")
|
||||
|
||||
if site != "":
|
||||
params["site"] = site
|
||||
logger.debug(f"InfoQuest - Applying site filter: site={site}")
|
||||
|
||||
response = requests.post(
|
||||
f"{INFOQUEST_API_URL}",
|
||||
headers=headers,
|
||||
json=params
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
# Print partial response for debugging
|
||||
response_json = response.json()
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
response_sample = json.dumps(response_json)[:200] + ("..." if len(json.dumps(response_json)) > 200 else "")
|
||||
logger.debug(
|
||||
f"Search API request completed successfully | "
|
||||
f"service=InfoQuest | "
|
||||
f"status=success | "
|
||||
f"response_sample={response_sample}"
|
||||
)
|
||||
|
||||
return response_json["search_result"]
|
||||
|
||||
async def raw_results_async(
|
||||
self,
|
||||
query: str,
|
||||
time_range: int,
|
||||
site: str,
|
||||
output_format: str = "JSON",
|
||||
) -> Dict:
|
||||
"""Get results from the InfoQuest Search API asynchronously."""
|
||||
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
query_truncated = query[:50] + "..." if len(query) > 50 else query
|
||||
logger.debug(
|
||||
f"BytePlus InfoQuest - Search API async request initiated | "
|
||||
f"operation=search | "
|
||||
f"query_truncated={query_truncated} | "
|
||||
f"has_time_filter={time_range > 0} | "
|
||||
f"has_site_filter={bool(site)} | "
|
||||
f"request_type=async"
|
||||
)
|
||||
# Function to perform the API call
|
||||
async def fetch() -> str:
|
||||
headers = {
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self.infoquest_api_key.get_secret_value()}",
|
||||
}
|
||||
params = {
|
||||
"format": output_format,
|
||||
"query": query,
|
||||
}
|
||||
if time_range > 0:
|
||||
params["time_range"] = time_range
|
||||
logger.debug(f"Applying time range filter in async request: {time_range} days")
|
||||
if site != "":
|
||||
params["site"] = site
|
||||
logger.debug(f"Applying site filter in async request: {site}")
|
||||
|
||||
async with aiohttp.ClientSession(trust_env=True) as session:
|
||||
async with session.post(f"{INFOQUEST_API_URL}", headers=headers, json=params) as res:
|
||||
if res.status == 200:
|
||||
data = await res.text()
|
||||
return data
|
||||
else:
|
||||
raise Exception(f"Error {res.status}: {res.reason}")
|
||||
results_json_str = await fetch()
|
||||
|
||||
# Print partial response for debugging
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
response_sample = results_json_str[:200] + ("..." if len(results_json_str) > 200 else "")
|
||||
logger.debug(
|
||||
f"Async search API request completed successfully | "
|
||||
f"service=InfoQuest | "
|
||||
f"status=success | "
|
||||
f"response_sample={response_sample}"
|
||||
)
|
||||
return json.loads(results_json_str)["search_result"]
|
||||
|
||||
def clean_results_with_images(
|
||||
self, raw_results: List[Dict[str, Dict[str, Dict[str, Any]]]]
|
||||
) -> List[Dict]:
|
||||
"""Clean results from InfoQuest Search API."""
|
||||
logger.debug("Processing search results")
|
||||
|
||||
seen_urls = set()
|
||||
clean_results = []
|
||||
counts = {"pages": 0, "news": 0, "images": 0}
|
||||
|
||||
for content_list in raw_results:
|
||||
content = content_list["content"]
|
||||
results = content["results"]
|
||||
|
||||
|
||||
if results.get("organic"):
|
||||
organic_results = results["organic"]
|
||||
for result in organic_results:
|
||||
clean_result = {
|
||||
"type": "page",
|
||||
"title": result["title"],
|
||||
"url": result["url"],
|
||||
"desc": result["desc"],
|
||||
}
|
||||
url = clean_result["url"]
|
||||
if isinstance(url, str) and url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
clean_results.append(clean_result)
|
||||
counts["pages"] += 1
|
||||
|
||||
if results.get("top_stories"):
|
||||
news = results["top_stories"]
|
||||
for obj in news["items"]:
|
||||
clean_result = {
|
||||
"type": "news",
|
||||
"time_frame": obj["time_frame"],
|
||||
"title": obj["title"],
|
||||
"url": obj["url"],
|
||||
"source": obj["source"],
|
||||
}
|
||||
url = clean_result["url"]
|
||||
if isinstance(url, str) and url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
clean_results.append(clean_result)
|
||||
counts["news"] += 1
|
||||
|
||||
if results.get("images"):
|
||||
images = results["images"]
|
||||
for image in images["items"]:
|
||||
clean_result = {
|
||||
"type": "image_url",
|
||||
"image_url": image["url"],
|
||||
"image_description": image["alt"],
|
||||
}
|
||||
url = clean_result["image_url"]
|
||||
if isinstance(url, str) and url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
clean_results.append(clean_result)
|
||||
counts["images"] += 1
|
||||
|
||||
logger.debug(
|
||||
f"Results processing completed | "
|
||||
f"total_results={len(clean_results)} | "
|
||||
f"pages={counts['pages']} | "
|
||||
f"news_items={counts['news']} | "
|
||||
f"images={counts['images']} | "
|
||||
f"unique_urls={len(seen_urls)}"
|
||||
)
|
||||
|
||||
return clean_results
|
||||
236
src/tools/infoquest_search/infoquest_search_results.py
Normal file
236
src/tools/infoquest_search/infoquest_search_results.py
Normal file
@@ -0,0 +1,236 @@
|
||||
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
"""Tool for the InfoQuest search API."""
|
||||
|
||||
import json
|
||||
import logging
|
||||
from typing import Any, Dict, List, Literal, Optional, Tuple, Type, Union
|
||||
from langchain_core.callbacks import (
|
||||
AsyncCallbackManagerForToolRun,
|
||||
CallbackManagerForToolRun,
|
||||
)
|
||||
from langchain_core.tools import BaseTool
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from src.tools.infoquest_search.infoquest_search_api import InfoQuestAPIWrapper
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class InfoQuestInput(BaseModel):
|
||||
"""Input for the InfoQuest tool."""
|
||||
|
||||
query: str = Field(description="search query to look up")
|
||||
|
||||
class InfoQuestSearchResults(BaseTool):
|
||||
"""Tool that queries the InfoQuest Search API and returns processed results with images.
|
||||
|
||||
Setup:
|
||||
Install required packages and set environment variable ``INFOQUEST_API_KEY``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install -U langchain-community aiohttp
|
||||
export INFOQUEST_API_KEY="your-api-key"
|
||||
|
||||
Instantiate:
|
||||
.. code-block:: python
|
||||
|
||||
from your_module import InfoQuestSearch
|
||||
|
||||
tool = InfoQuestSearchResults(
|
||||
output_format="json",
|
||||
time_range=10,
|
||||
site="nytimes.com"
|
||||
)
|
||||
|
||||
Invoke directly with args:
|
||||
.. code-block:: python
|
||||
|
||||
tool.invoke({
|
||||
'query': 'who won the last french open'
|
||||
})
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
[
|
||||
{
|
||||
"type": "page",
|
||||
"title": "Djokovic Claims French Open Title...",
|
||||
"url": "https://www.nytimes.com/...",
|
||||
"desc": "Novak Djokovic won the 2024 French Open by defeating Casper Ruud..."
|
||||
},
|
||||
{
|
||||
"type": "news",
|
||||
"time_frame": "2 days ago",
|
||||
"title": "French Open Finals Recap",
|
||||
"url": "https://www.nytimes.com/...",
|
||||
"source": "New York Times"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": "https://www.nytimes.com/.../djokovic.jpg"},
|
||||
"image_description": "Novak Djokovic celebrating his French Open victory"
|
||||
}
|
||||
]
|
||||
|
||||
Invoke with tool call:
|
||||
.. code-block:: python
|
||||
|
||||
tool.invoke({
|
||||
"args": {
|
||||
'query': 'who won the last french open',
|
||||
},
|
||||
"type": "tool_call",
|
||||
"id": "foo",
|
||||
"name": "infoquest"
|
||||
})
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
ToolMessage(
|
||||
content='[
|
||||
{"type": "page", "title": "Djokovic Claims...", "url": "https://www.nytimes.com/...", "desc": "Novak Djokovic won..."},
|
||||
{"type": "news", "time_frame": "2 days ago", "title": "French Open Finals...", "url": "https://www.nytimes.com/...", "source": "New York Times"},
|
||||
{"type": "image_url", "image_url": {"url": "https://www.nytimes.com/.../djokovic.jpg"}, "image_description": "Novak Djokovic celebrating..."}
|
||||
]',
|
||||
tool_call_id='1',
|
||||
name='infoquest_search_results_json',
|
||||
)
|
||||
|
||||
|
||||
""" # noqa: E501
|
||||
|
||||
name: str = "infoquest_search_results_json"
|
||||
description: str = (
|
||||
"A search engine optimized for comprehensive, accurate, and trusted results. "
|
||||
"Useful for when you need to answer questions about current events. "
|
||||
"Input should be a search query."
|
||||
)
|
||||
args_schema: Type[BaseModel] = InfoQuestInput
|
||||
"""The tool response format."""
|
||||
|
||||
time_range: int = -1
|
||||
"""Time range for filtering search results, in days.
|
||||
|
||||
If set to a positive integer (e.g., 30), only results from the last N days will be included.
|
||||
Default is -1, which means no time range filter is applied.
|
||||
"""
|
||||
|
||||
site: str = ""
|
||||
"""Specific domain to restrict search results to (e.g., "nytimes.com").
|
||||
|
||||
If provided, only results from the specified domain will be returned.
|
||||
Default is an empty string, which means no domain restriction is applied.
|
||||
"""
|
||||
|
||||
api_wrapper: InfoQuestAPIWrapper = Field(default_factory=InfoQuestAPIWrapper) # type: ignore[arg-type]
|
||||
response_format: Literal["content_and_artifact"] = "content_and_artifact"
|
||||
|
||||
def __init__(self, **kwargs: Any) -> None:
|
||||
# Create api_wrapper with infoquest_api_key if provided
|
||||
if "infoquest_api_key" in kwargs:
|
||||
kwargs["api_wrapper"] = InfoQuestAPIWrapper(
|
||||
infoquest_api_key=kwargs["infoquest_api_key"]
|
||||
)
|
||||
logger.debug("API wrapper initialized with provided key")
|
||||
|
||||
super().__init__(**kwargs)
|
||||
|
||||
logger.info(
|
||||
"\n============================================\n"
|
||||
"🚀 BytePlus InfoQuest Search Initialization 🚀\n"
|
||||
"============================================"
|
||||
)
|
||||
|
||||
# Prepare initialization details
|
||||
time_range_status = f"{self.time_range} days" if hasattr(self, 'time_range') and self.time_range > 0 else "Disabled"
|
||||
site_filter = f"'{self.site}'" if hasattr(self, 'site') and self.site else "Disabled"
|
||||
|
||||
initialization_details = (
|
||||
f"\n🔧 Tool Information:\n"
|
||||
f"├── Tool Name: {self.name}\n"
|
||||
f"├── Time Range Filter: {time_range_status}\n"
|
||||
f"└── Site Filter: {site_filter}\n"
|
||||
f"📊 Configuration Summary:\n"
|
||||
f"├── Response Format: {self.response_format}\n"
|
||||
)
|
||||
|
||||
logger.info(initialization_details)
|
||||
logger.info("\n" + "*" * 70 + "\n")
|
||||
|
||||
def _run(
|
||||
self,
|
||||
query: str,
|
||||
run_manager: Optional[CallbackManagerForToolRun] = None,
|
||||
) -> Tuple[Union[List[Dict[str, str]], str], Dict]:
|
||||
"""Use the tool."""
|
||||
try:
|
||||
logger.debug(f"Executing search with parameters: time_range={self.time_range}, site={self.site}")
|
||||
raw_results = self.api_wrapper.raw_results(
|
||||
query,
|
||||
self.time_range,
|
||||
self.site
|
||||
)
|
||||
logger.debug("Processing raw search results")
|
||||
cleaned_results = self.api_wrapper.clean_results_with_images(raw_results["results"])
|
||||
|
||||
result_json = json.dumps(cleaned_results, ensure_ascii=False)
|
||||
|
||||
logger.info(
|
||||
f"Search tool execution completed | "
|
||||
f"mode=synchronous | "
|
||||
f"results_count={len(cleaned_results)}"
|
||||
)
|
||||
return result_json, raw_results
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Search tool execution failed | "
|
||||
f"mode=synchronous | "
|
||||
f"error={str(e)}"
|
||||
)
|
||||
error_result = json.dumps({"error": repr(e)}, ensure_ascii=False)
|
||||
return error_result, {}
|
||||
|
||||
async def _arun(
|
||||
self,
|
||||
query: str,
|
||||
run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
|
||||
) -> Tuple[Union[List[Dict[str, str]], str], Dict]:
|
||||
"""Use the tool asynchronously."""
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
query_truncated = query[:50] + "..." if len(query) > 50 else query
|
||||
logger.debug(
|
||||
f"Search tool execution started | "
|
||||
f"mode=asynchronous | "
|
||||
f"query={query_truncated}"
|
||||
)
|
||||
try:
|
||||
logger.debug(f"Executing async search with parameters: time_range={self.time_range}, site={self.site}")
|
||||
|
||||
raw_results = await self.api_wrapper.raw_results_async(
|
||||
query,
|
||||
self.time_range,
|
||||
self.site
|
||||
)
|
||||
|
||||
logger.debug("Processing raw async search results")
|
||||
cleaned_results = self.api_wrapper.clean_results_with_images(raw_results["results"])
|
||||
|
||||
result_json = json.dumps(cleaned_results, ensure_ascii=False)
|
||||
|
||||
logger.debug(
|
||||
f"Search tool execution completed | "
|
||||
f"mode=asynchronous | "
|
||||
f"results_count={len(cleaned_results)}"
|
||||
)
|
||||
|
||||
return result_json, raw_results
|
||||
except Exception as e:
|
||||
logger.error(
|
||||
f"Search tool execution failed | "
|
||||
f"mode=asynchronous | "
|
||||
f"error={str(e)}"
|
||||
)
|
||||
error_result = json.dumps({"error": repr(e)}, ensure_ascii=False)
|
||||
return error_result, {}
|
||||
@@ -21,6 +21,7 @@ from langchain_community.utilities import (
|
||||
|
||||
from src.config import SELECTED_SEARCH_ENGINE, SearchEngine, load_yaml_config
|
||||
from src.tools.decorators import create_logged_tool
|
||||
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
|
||||
from src.tools.tavily_search.tavily_search_results_with_images import (
|
||||
TavilySearchWithImages,
|
||||
)
|
||||
@@ -29,6 +30,7 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
# Create logged versions of the search tools
|
||||
LoggedTavilySearch = create_logged_tool(TavilySearchWithImages)
|
||||
LoggedInfoQuestSearch = create_logged_tool(InfoQuestSearchResults)
|
||||
LoggedDuckDuckGoSearch = create_logged_tool(DuckDuckGoSearchResults)
|
||||
LoggedBraveSearch = create_logged_tool(BraveSearch)
|
||||
LoggedArxivSearch = create_logged_tool(ArxivQueryRun)
|
||||
@@ -76,6 +78,17 @@ def get_web_search_tool(max_search_results: int):
|
||||
include_domains=include_domains,
|
||||
exclude_domains=exclude_domains,
|
||||
)
|
||||
elif SELECTED_SEARCH_ENGINE == SearchEngine.INFOQUEST.value:
|
||||
time_range = search_config.get("time_range", -1)
|
||||
site = search_config.get("site", "")
|
||||
logger.info(
|
||||
f"InfoQuest search configuration loaded: time_range={time_range}, site={site}"
|
||||
)
|
||||
return LoggedInfoQuestSearch(
|
||||
name="web_search",
|
||||
time_range=time_range,
|
||||
site=site,
|
||||
)
|
||||
elif SELECTED_SEARCH_ENGINE == SearchEngine.DUCKDUCKGO.value:
|
||||
return LoggedDuckDuckGoSearch(
|
||||
name="web_search",
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
|
||||
import src.crawler as crawler_module
|
||||
from src.crawler.crawler import safe_truncate
|
||||
from src.crawler.infoquest_client import InfoQuestClient
|
||||
|
||||
|
||||
def test_crawler_sets_article_url(monkeypatch):
|
||||
@@ -19,16 +20,28 @@ def test_crawler_sets_article_url(monkeypatch):
|
||||
def crawl(self, url, return_format=None):
|
||||
return "<html>dummy</html>"
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
pass
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
return "<html>dummy</html>"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
return DummyArticle()
|
||||
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
assert article.url == url
|
||||
@@ -44,6 +57,16 @@ def test_crawler_calls_dependencies(monkeypatch):
|
||||
calls["jina"] = (url, return_format)
|
||||
return "<html>dummy</html>"
|
||||
|
||||
# Fix: Update DummyInfoQuestClient to accept initialization parameters
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
# We don't need to use these parameters, just accept them
|
||||
pass
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
calls["infoquest"] = (url, return_format)
|
||||
return "<html>dummy</html>"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
calls["extractor"] = html
|
||||
@@ -56,12 +79,16 @@ def test_crawler_calls_dependencies(monkeypatch):
|
||||
|
||||
return DummyArticle()
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
# Add mock for load_yaml_config to ensure it returns configuration with Jina engine
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient) # Include this if InfoQuest might be used
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
crawler.crawl(url)
|
||||
assert "jina" in calls
|
||||
@@ -92,16 +119,61 @@ def test_crawler_handles_empty_content(monkeypatch):
|
||||
# This should not be called for empty content
|
||||
assert False, "ReadabilityExtractor should not be called for empty content"
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title == "Empty Content"
|
||||
assert "No content could be extracted" in article.html_content
|
||||
assert "No content could be extracted from this page" in article.html_content
|
||||
|
||||
|
||||
def test_crawler_handles_error_response_from_client(monkeypatch):
|
||||
"""Test that the crawler handles error responses from the client gracefully."""
|
||||
|
||||
class DummyArticle:
|
||||
def __init__(self, title, html_content):
|
||||
self.title = title
|
||||
self.html_content = html_content
|
||||
self.url = None
|
||||
|
||||
def to_markdown(self):
|
||||
return f"# {self.title}"
|
||||
|
||||
class DummyJinaClient:
|
||||
def crawl(self, url, return_format=None):
|
||||
return "Error: API returned status 500"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
# This should not be called for error responses
|
||||
assert False, "ReadabilityExtractor should not be called for error responses"
|
||||
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
|
||||
assert "Error: API returned status 500" in article.html_content
|
||||
|
||||
|
||||
def test_crawler_handles_non_html_content(monkeypatch):
|
||||
@@ -125,16 +197,22 @@ def test_crawler_handles_non_html_content(monkeypatch):
|
||||
# This should not be called for non-HTML content
|
||||
assert False, "ReadabilityExtractor should not be called for non-HTML content"
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title == "Non-HTML Content"
|
||||
assert "cannot be parsed as HTML" in article.html_content
|
||||
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
|
||||
assert "cannot be parsed as HTML" in article.html_content or "Content extraction failed" in article.html_content
|
||||
assert "plain text content" in article.html_content # Should include a snippet of the original content
|
||||
|
||||
|
||||
@@ -158,10 +236,16 @@ def test_crawler_handles_extraction_failure(monkeypatch):
|
||||
def extract_article(self, html):
|
||||
raise Exception("Extraction failed")
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
@@ -192,16 +276,22 @@ def test_crawler_with_json_like_content(monkeypatch):
|
||||
# This should not be called for JSON content
|
||||
assert False, "ReadabilityExtractor should not be called for JSON content"
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com/api/data"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title == "Non-HTML Content"
|
||||
assert "cannot be parsed as HTML" in article.html_content
|
||||
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
|
||||
assert "cannot be parsed as HTML" in article.html_content or "Content extraction failed" in article.html_content
|
||||
assert '{"title": "Some JSON"' in article.html_content # Should include a snippet of the JSON
|
||||
|
||||
|
||||
@@ -241,6 +331,9 @@ def test_crawler_with_various_html_formats(monkeypatch):
|
||||
def extract_article(self, html):
|
||||
return DummyArticle("Extracted Article", "<p>Extracted content</p>")
|
||||
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "jina"}}
|
||||
|
||||
# Test each HTML format
|
||||
test_cases = [
|
||||
(DummyJinaClient1, "HTML with DOCTYPE"),
|
||||
@@ -252,8 +345,9 @@ def test_crawler_with_various_html_formats(monkeypatch):
|
||||
for JinaClientClass, description in test_cases:
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", JinaClientClass)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.Crawler()
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
@@ -298,3 +392,284 @@ def test_safe_truncate_function():
|
||||
assert len(result) <= 10
|
||||
# Verify it's valid UTF-8
|
||||
assert result.encode('utf-8').decode('utf-8') == result
|
||||
|
||||
# ========== InfoQuest Client Tests ==========
|
||||
|
||||
def test_crawler_selects_infoquest_engine(monkeypatch):
|
||||
"""Test that the crawler selects InfoQuestClient when configured to use it."""
|
||||
calls = {}
|
||||
|
||||
class DummyJinaClient:
|
||||
def crawl(self, url, return_format=None):
|
||||
calls["jina"] = True
|
||||
return "<html>dummy</html>"
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
calls["infoquest_init"] = (fetch_time, timeout, navi_timeout)
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
calls["infoquest"] = (url, return_format)
|
||||
return "<html>dummy from infoquest</html>"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
calls["extractor"] = html
|
||||
|
||||
class DummyArticle:
|
||||
url = None
|
||||
|
||||
def to_markdown(self):
|
||||
return "# Dummy"
|
||||
|
||||
return DummyArticle()
|
||||
|
||||
# Mock configuration to use InfoQuest engine with custom parameters
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {
|
||||
"engine": "infoquest",
|
||||
"fetch_time": 30,
|
||||
"timeout": 60,
|
||||
"navi_timeout": 45
|
||||
}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.JinaClient", DummyJinaClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
crawler.crawl(url)
|
||||
|
||||
# Verify InfoQuestClient was used, not JinaClient
|
||||
assert "infoquest_init" in calls
|
||||
assert calls["infoquest_init"] == (30, 60, 45) # Verify parameters were passed correctly
|
||||
assert "infoquest" in calls
|
||||
assert calls["infoquest"][0] == url
|
||||
assert calls["infoquest"][1] == "html"
|
||||
assert "extractor" in calls
|
||||
assert calls["extractor"] == "<html>dummy from infoquest</html>"
|
||||
assert "jina" not in calls
|
||||
|
||||
|
||||
def test_crawler_with_infoquest_empty_content(monkeypatch):
|
||||
"""Test that the crawler handles empty content from InfoQuest client gracefully."""
|
||||
|
||||
class DummyArticle:
|
||||
def __init__(self, title, html_content):
|
||||
self.title = title
|
||||
self.html_content = html_content
|
||||
self.url = None
|
||||
|
||||
def to_markdown(self):
|
||||
return f"# {self.title}"
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
pass
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
return "" # Empty content
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
# This should not be called for empty content
|
||||
assert False, "ReadabilityExtractor should not be called for empty content"
|
||||
|
||||
# Mock configuration to use InfoQuest engine
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title == "Empty Content"
|
||||
assert "No content could be extracted from this page" in article.html_content
|
||||
|
||||
|
||||
def test_crawler_with_infoquest_non_html_content(monkeypatch):
|
||||
"""Test that the crawler handles non-HTML content from InfoQuest client gracefully."""
|
||||
|
||||
class DummyArticle:
|
||||
def __init__(self, title, html_content):
|
||||
self.title = title
|
||||
self.html_content = html_content
|
||||
self.url = None
|
||||
|
||||
def to_markdown(self):
|
||||
return f"# {self.title}"
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
pass
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
return "This is plain text content from InfoQuest, not HTML"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
# This should not be called for non-HTML content
|
||||
assert False, "ReadabilityExtractor should not be called for non-HTML content"
|
||||
|
||||
# Mock configuration to use InfoQuest engine
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
|
||||
assert "cannot be parsed as HTML" in article.html_content or "Content extraction failed" in article.html_content
|
||||
assert "plain text content from InfoQuest" in article.html_content
|
||||
|
||||
|
||||
def test_crawler_with_infoquest_error_response(monkeypatch):
|
||||
"""Test that the crawler handles error responses from InfoQuest client gracefully."""
|
||||
|
||||
class DummyArticle:
|
||||
def __init__(self, title, html_content):
|
||||
self.title = title
|
||||
self.html_content = html_content
|
||||
self.url = None
|
||||
|
||||
def to_markdown(self):
|
||||
return f"# {self.title}"
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
pass
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
return "Error: InfoQuest API returned status 403: Forbidden"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
# This should not be called for error responses
|
||||
assert False, "ReadabilityExtractor should not be called for error responses"
|
||||
|
||||
# Mock configuration to use InfoQuest engine
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title in ["Non-HTML Content", "Content Extraction Failed"]
|
||||
assert "Error: InfoQuest API returned status 403: Forbidden" in article.html_content
|
||||
|
||||
|
||||
def test_crawler_with_infoquest_json_response(monkeypatch):
|
||||
"""Test that the crawler handles JSON responses from InfoQuest client correctly."""
|
||||
|
||||
class DummyArticle:
|
||||
def __init__(self, title, html_content):
|
||||
self.title = title
|
||||
self.html_content = html_content
|
||||
self.url = None
|
||||
|
||||
def to_markdown(self):
|
||||
return f"# {self.title}"
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
pass
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
return "<html><body>Content from InfoQuest JSON</body></html>"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
return DummyArticle("Extracted from JSON", html)
|
||||
|
||||
# Mock configuration to use InfoQuest engine
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr(
|
||||
"src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor
|
||||
)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
url = "http://example.com"
|
||||
article = crawler.crawl(url)
|
||||
|
||||
assert article.url == url
|
||||
assert article.title == "Extracted from JSON"
|
||||
assert "Content from InfoQuest JSON" in article.html_content
|
||||
|
||||
|
||||
def test_infoquest_client_initialization_params():
|
||||
"""Test that InfoQuestClient correctly initializes with the provided parameters."""
|
||||
# Test default parameters
|
||||
client_default = InfoQuestClient()
|
||||
assert client_default.fetch_time == -1
|
||||
assert client_default.timeout == -1
|
||||
assert client_default.navi_timeout == -1
|
||||
|
||||
# Test custom parameters
|
||||
client_custom = InfoQuestClient(fetch_time=30, timeout=60, navi_timeout=45)
|
||||
assert client_custom.fetch_time == 30
|
||||
assert client_custom.timeout == 60
|
||||
assert client_custom.navi_timeout == 45
|
||||
|
||||
|
||||
def test_crawler_with_infoquest_default_parameters(monkeypatch):
|
||||
"""Test that the crawler initializes InfoQuestClient with default parameters when none are provided."""
|
||||
calls = {}
|
||||
|
||||
class DummyInfoQuestClient:
|
||||
def __init__(self, fetch_time=None, timeout=None, navi_timeout=None):
|
||||
calls["infoquest_init"] = (fetch_time, timeout, navi_timeout)
|
||||
|
||||
def crawl(self, url, return_format=None):
|
||||
return "<html>dummy</html>"
|
||||
|
||||
class DummyReadabilityExtractor:
|
||||
def extract_article(self, html):
|
||||
class DummyArticle:
|
||||
url = None
|
||||
def to_markdown(self):
|
||||
return "# Dummy"
|
||||
return DummyArticle()
|
||||
|
||||
# Mock configuration to use InfoQuest engine without custom parameters
|
||||
def mock_load_config(*args, **kwargs):
|
||||
return {"CRAWLER_ENGINE": {"engine": "infoquest"}}
|
||||
|
||||
monkeypatch.setattr("src.crawler.crawler.InfoQuestClient", DummyInfoQuestClient)
|
||||
monkeypatch.setattr("src.crawler.crawler.ReadabilityExtractor", DummyReadabilityExtractor)
|
||||
monkeypatch.setattr("src.crawler.crawler.load_yaml_config", mock_load_config)
|
||||
|
||||
crawler = crawler_module.crawler.Crawler()
|
||||
crawler.crawl("http://example.com")
|
||||
|
||||
# Verify default parameters were passed
|
||||
assert "infoquest_init" in calls
|
||||
assert calls["infoquest_init"] == (-1, -1, -1)
|
||||
230
tests/unit/crawler/test_infoquest_client.py
Normal file
230
tests/unit/crawler/test_infoquest_client.py
Normal file
@@ -0,0 +1,230 @@
|
||||
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
from unittest.mock import Mock, patch
|
||||
import json
|
||||
|
||||
|
||||
|
||||
from src.crawler.infoquest_client import InfoQuestClient
|
||||
|
||||
|
||||
class TestInfoQuestClient:
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_success(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.text = "<html><body>Test Content</body></html>"
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result == "<html><body>Test Content</body></html>"
|
||||
mock_post.assert_called_once()
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_json_response_with_reader_result(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
json_data = {
|
||||
"reader_result": "<p>Extracted content from JSON</p>",
|
||||
"err_code": 0,
|
||||
"err_msg": "success"
|
||||
}
|
||||
mock_response.text = json.dumps(json_data)
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result == "<p>Extracted content from JSON</p>"
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_json_response_with_content_fallback(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
json_data = {
|
||||
"content": "<p>Content fallback from JSON</p>",
|
||||
"err_code": 0,
|
||||
"err_msg": "success"
|
||||
}
|
||||
mock_response.text = json.dumps(json_data)
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result == "<p>Content fallback from JSON</p>"
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_json_response_without_expected_fields(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
json_data = {
|
||||
"unexpected_field": "some value",
|
||||
"err_code": 0,
|
||||
"err_msg": "success"
|
||||
}
|
||||
mock_response.text = json.dumps(json_data)
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result == json.dumps(json_data)
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_http_error(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 500
|
||||
mock_response.text = "Internal Server Error"
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "status 500" in result
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_empty_response(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.text = ""
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "empty response" in result
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_whitespace_only_response(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.text = " \n \t "
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "empty response" in result
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_not_found(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 404
|
||||
mock_response.text = "Not Found"
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "status 404" in result
|
||||
|
||||
@patch.dict("os.environ", {}, clear=True)
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_without_api_key_logs_warning(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.text = "<html>Test</html>"
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result == "<html>Test</html>"
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_with_timeout_parameters(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.text = "<html>Test</html>"
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient(fetch_time=10, timeout=20, navi_timeout=30)
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result == "<html>Test</html>"
|
||||
# Verify the post call was made with timeout parameters
|
||||
call_args = mock_post.call_args[1]
|
||||
assert call_args['json']['fetch_time'] == 10
|
||||
assert call_args['json']['timeout'] == 20
|
||||
assert call_args['json']['navi_timeout'] == 30
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_with_markdown_format(self, mock_post):
|
||||
# Arrange
|
||||
mock_response = Mock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.text = "# Markdown Content"
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com", return_format="markdown")
|
||||
|
||||
# Assert
|
||||
assert result == "# Markdown Content"
|
||||
# Verify the format was set correctly
|
||||
call_args = mock_post.call_args[1]
|
||||
assert call_args['json']['format'] == "markdown"
|
||||
|
||||
@patch("src.crawler.infoquest_client.requests.post")
|
||||
def test_crawl_exception_handling(self, mock_post):
|
||||
# Arrange
|
||||
mock_post.side_effect = Exception("Network error")
|
||||
|
||||
client = InfoQuestClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "Network error" in result
|
||||
@@ -36,11 +36,12 @@ class TestJinaClient:
|
||||
|
||||
client = JinaClient()
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
client.crawl("https://example.com")
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
assert "status 500" in str(exc_info.value)
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "status 500" in result
|
||||
|
||||
@patch("src.crawler.jina_client.requests.post")
|
||||
def test_crawl_empty_response(self, mock_post):
|
||||
@@ -52,11 +53,12 @@ class TestJinaClient:
|
||||
|
||||
client = JinaClient()
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
client.crawl("https://example.com")
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
assert "empty response" in str(exc_info.value)
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "empty response" in result
|
||||
|
||||
@patch("src.crawler.jina_client.requests.post")
|
||||
def test_crawl_whitespace_only_response(self, mock_post):
|
||||
@@ -68,11 +70,12 @@ class TestJinaClient:
|
||||
|
||||
client = JinaClient()
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
client.crawl("https://example.com")
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
assert "empty response" in str(exc_info.value)
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "empty response" in result
|
||||
|
||||
@patch("src.crawler.jina_client.requests.post")
|
||||
def test_crawl_not_found(self, mock_post):
|
||||
@@ -84,11 +87,12 @@ class TestJinaClient:
|
||||
|
||||
client = JinaClient()
|
||||
|
||||
# Act & Assert
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
client.crawl("https://example.com")
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
assert "status 404" in str(exc_info.value)
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "status 404" in result
|
||||
|
||||
@patch.dict("os.environ", {}, clear=True)
|
||||
@patch("src.crawler.jina_client.requests.post")
|
||||
@@ -106,3 +110,17 @@ class TestJinaClient:
|
||||
|
||||
# Assert
|
||||
assert result == "<html>Test</html>"
|
||||
|
||||
@patch("src.crawler.jina_client.requests.post")
|
||||
def test_crawl_exception_handling(self, mock_post):
|
||||
# Arrange
|
||||
mock_post.side_effect = Exception("Network error")
|
||||
|
||||
client = JinaClient()
|
||||
|
||||
# Act
|
||||
result = client.crawl("https://example.com")
|
||||
|
||||
# Assert
|
||||
assert result.startswith("Error:")
|
||||
assert "Network error" in result
|
||||
218
tests/unit/tools/test_infoquest_search_api.py
Normal file
218
tests/unit/tools/test_infoquest_search_api.py
Normal file
@@ -0,0 +1,218 @@
|
||||
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
import pytest
|
||||
import requests
|
||||
|
||||
from src.tools.infoquest_search.infoquest_search_api import InfoQuestAPIWrapper
|
||||
|
||||
class TestInfoQuestAPIWrapper:
|
||||
@pytest.fixture
|
||||
def wrapper(self):
|
||||
# Create a wrapper instance with mock API key
|
||||
return InfoQuestAPIWrapper(infoquest_api_key="dummy-key")
|
||||
|
||||
@pytest.fixture
|
||||
def mock_response_data(self):
|
||||
# Mock search result data
|
||||
return {
|
||||
"search_result": {
|
||||
"results": [
|
||||
{
|
||||
"content": {
|
||||
"results": {
|
||||
"organic": [
|
||||
{
|
||||
"title": "Test Title",
|
||||
"url": "https://example.com",
|
||||
"desc": "Test description"
|
||||
}
|
||||
],
|
||||
"top_stories": {
|
||||
"items": [
|
||||
{
|
||||
"time_frame": "2 days ago",
|
||||
"title": "Test News",
|
||||
"url": "https://example.com/news",
|
||||
"source": "Test Source"
|
||||
}
|
||||
]
|
||||
},
|
||||
"images": {
|
||||
"items": [
|
||||
{
|
||||
"url": "https://example.com/image.jpg",
|
||||
"alt": "Test image description"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
@patch("src.tools.infoquest_search.infoquest_search_api.requests.post")
|
||||
def test_raw_results_success(self, mock_post, wrapper, mock_response_data):
|
||||
# Test successful synchronous search results
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = mock_response_data
|
||||
mock_response.raise_for_status.return_value = None
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = wrapper.raw_results("test query", time_range=0, site="")
|
||||
|
||||
assert result == mock_response_data["search_result"]
|
||||
mock_post.assert_called_once()
|
||||
call_args = mock_post.call_args
|
||||
assert "json" in call_args.kwargs
|
||||
assert call_args.kwargs["json"]["query"] == "test query"
|
||||
assert "time_range" not in call_args.kwargs["json"]
|
||||
assert "site" not in call_args.kwargs["json"]
|
||||
|
||||
@patch("src.tools.infoquest_search.infoquest_search_api.requests.post")
|
||||
def test_raw_results_with_time_range_and_site(self, mock_post, wrapper, mock_response_data):
|
||||
# Test search with time range and site filtering
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = mock_response_data
|
||||
mock_response.raise_for_status.return_value = None
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = wrapper.raw_results("test query", time_range=30, site="example.com")
|
||||
|
||||
assert result == mock_response_data["search_result"]
|
||||
call_args = mock_post.call_args
|
||||
params = call_args.kwargs["json"]
|
||||
assert params["time_range"] == 30
|
||||
assert params["site"] == "example.com"
|
||||
|
||||
@patch("src.tools.infoquest_search.infoquest_search_api.requests.post")
|
||||
def test_raw_results_http_error(self, mock_post, wrapper):
|
||||
# Test HTTP error handling
|
||||
mock_response = Mock()
|
||||
mock_response.raise_for_status.side_effect = requests.HTTPError("API Error")
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
with pytest.raises(requests.HTTPError):
|
||||
wrapper.raw_results("test query", time_range=0, site="")
|
||||
|
||||
# Check if pytest-asyncio is available, otherwise mark for conditional skipping
|
||||
try:
|
||||
import pytest_asyncio
|
||||
_asyncio_available = True
|
||||
except ImportError:
|
||||
_asyncio_available = False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_results_async_success(self, wrapper, mock_response_data):
|
||||
# Skip only if pytest-asyncio is not installed
|
||||
if not self._asyncio_available:
|
||||
pytest.skip("pytest-asyncio is not installed")
|
||||
|
||||
with patch('json.loads', return_value=mock_response_data):
|
||||
original_method = InfoQuestAPIWrapper.raw_results_async
|
||||
|
||||
async def mock_raw_results_async(self, query, time_range=0, site="", output_format="json"):
|
||||
return mock_response_data["search_result"]
|
||||
|
||||
InfoQuestAPIWrapper.raw_results_async = mock_raw_results_async
|
||||
|
||||
try:
|
||||
result = await wrapper.raw_results_async("test query", time_range=0, site="")
|
||||
assert result == mock_response_data["search_result"]
|
||||
finally:
|
||||
InfoQuestAPIWrapper.raw_results_async = original_method
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_results_async_error(self, wrapper):
|
||||
if not self._asyncio_available:
|
||||
pytest.skip("pytest-asyncio is not installed")
|
||||
|
||||
original_method = InfoQuestAPIWrapper.raw_results_async
|
||||
|
||||
async def mock_raw_results_async_error(self, query, time_range=0, site="", output_format="json"):
|
||||
raise Exception("Error 400: Bad Request")
|
||||
|
||||
InfoQuestAPIWrapper.raw_results_async = mock_raw_results_async_error
|
||||
|
||||
try:
|
||||
with pytest.raises(Exception, match="Error 400: Bad Request"):
|
||||
await wrapper.raw_results_async("test query", time_range=0, site="")
|
||||
finally:
|
||||
InfoQuestAPIWrapper.raw_results_async = original_method
|
||||
|
||||
def test_clean_results_with_images(self, wrapper, mock_response_data):
|
||||
# Test result cleaning functionality
|
||||
raw_results = mock_response_data["search_result"]["results"]
|
||||
cleaned_results = wrapper.clean_results_with_images(raw_results)
|
||||
|
||||
assert len(cleaned_results) == 3
|
||||
|
||||
# Test page result
|
||||
page_result = cleaned_results[0]
|
||||
assert page_result["type"] == "page"
|
||||
assert page_result["title"] == "Test Title"
|
||||
assert page_result["url"] == "https://example.com"
|
||||
assert page_result["desc"] == "Test description"
|
||||
|
||||
# Test news result
|
||||
news_result = cleaned_results[1]
|
||||
assert news_result["type"] == "news"
|
||||
assert news_result["time_frame"] == "2 days ago"
|
||||
assert news_result["title"] == "Test News"
|
||||
assert news_result["url"] == "https://example.com/news"
|
||||
assert news_result["source"] == "Test Source"
|
||||
|
||||
# Test image result
|
||||
image_result = cleaned_results[2]
|
||||
assert image_result["type"] == "image_url"
|
||||
assert image_result["image_url"] == "https://example.com/image.jpg"
|
||||
assert image_result["image_description"] == "Test image description"
|
||||
|
||||
def test_clean_results_empty_categories(self, wrapper):
|
||||
# Test result cleaning with empty categories
|
||||
data = [
|
||||
{
|
||||
"content": {
|
||||
"results": {
|
||||
"organic": [],
|
||||
"top_stories": {"items": []},
|
||||
"images": {"items": []}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
result = wrapper.clean_results_with_images(data)
|
||||
assert len(result) == 0
|
||||
|
||||
def test_clean_results_url_deduplication(self, wrapper):
|
||||
# Test URL deduplication functionality
|
||||
data = [
|
||||
{
|
||||
"content": {
|
||||
"results": {
|
||||
"organic": [
|
||||
{
|
||||
"title": "Test Title 1",
|
||||
"url": "https://example.com",
|
||||
"desc": "Description 1"
|
||||
},
|
||||
{
|
||||
"title": "Test Title 2",
|
||||
"url": "https://example.com",
|
||||
"desc": "Description 2"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
result = wrapper.clean_results_with_images(data)
|
||||
assert len(result) == 1
|
||||
assert result[0]["title"] == "Test Title 1"
|
||||
226
tests/unit/tools/test_infoquest_search_results.py
Normal file
226
tests/unit/tools/test_infoquest_search_results.py
Normal file
@@ -0,0 +1,226 @@
|
||||
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
import json
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
|
||||
|
||||
class TestInfoQuestSearchResults:
|
||||
@pytest.fixture
|
||||
def search_tool(self):
|
||||
"""Create a mock InfoQuestSearchResults instance."""
|
||||
mock_tool = Mock()
|
||||
|
||||
mock_tool.time_range = 30
|
||||
mock_tool.site = "example.com"
|
||||
|
||||
def mock_run(query, **kwargs):
|
||||
sample_cleaned_results = [
|
||||
{
|
||||
"type": "page",
|
||||
"title": "Test Title",
|
||||
"url": "https://example.com",
|
||||
"desc": "Test description"
|
||||
}
|
||||
]
|
||||
sample_raw_results = {
|
||||
"results": [
|
||||
{
|
||||
"content": {
|
||||
"results": {
|
||||
"organic": [
|
||||
{
|
||||
"title": "Test Title",
|
||||
"url": "https://example.com",
|
||||
"desc": "Test description"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
return json.dumps(sample_cleaned_results, ensure_ascii=False), sample_raw_results
|
||||
|
||||
async def mock_arun(query, **kwargs):
|
||||
return mock_run(query, **kwargs)
|
||||
|
||||
mock_tool._run = mock_run
|
||||
mock_tool._arun = mock_arun
|
||||
|
||||
return mock_tool
|
||||
|
||||
@pytest.fixture
|
||||
def sample_raw_results(self):
|
||||
"""Sample raw results from InfoQuest API."""
|
||||
return {
|
||||
"results": [
|
||||
{
|
||||
"content": {
|
||||
"results": {
|
||||
"organic": [
|
||||
{
|
||||
"title": "Test Title",
|
||||
"url": "https://example.com",
|
||||
"desc": "Test description"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
@pytest.fixture
|
||||
def sample_cleaned_results(self):
|
||||
"""Sample cleaned results."""
|
||||
return [
|
||||
{
|
||||
"type": "page",
|
||||
"title": "Test Title",
|
||||
"url": "https://example.com",
|
||||
"desc": "Test description"
|
||||
}
|
||||
]
|
||||
|
||||
def test_init_default_values(self):
|
||||
"""Test initialization with default values using patch."""
|
||||
with patch('src.tools.infoquest_search.infoquest_search_results.InfoQuestAPIWrapper') as mock_wrapper_class:
|
||||
mock_instance = Mock()
|
||||
mock_wrapper_class.return_value = mock_instance
|
||||
|
||||
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
|
||||
|
||||
with patch.object(InfoQuestSearchResults, '__init__', return_value=None) as mock_init:
|
||||
InfoQuestSearchResults(infoquest_api_key="dummy-key")
|
||||
|
||||
mock_init.assert_called_once()
|
||||
|
||||
def test_init_custom_values(self):
|
||||
"""Test initialization with custom values using patch."""
|
||||
with patch('src.tools.infoquest_search.infoquest_search_results.InfoQuestAPIWrapper') as mock_wrapper_class:
|
||||
mock_instance = Mock()
|
||||
mock_wrapper_class.return_value = mock_instance
|
||||
|
||||
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
|
||||
|
||||
with patch.object(InfoQuestSearchResults, '__init__', return_value=None) as mock_init:
|
||||
InfoQuestSearchResults(
|
||||
time_range=10,
|
||||
site="test.com",
|
||||
infoquest_api_key="dummy-key"
|
||||
)
|
||||
|
||||
mock_init.assert_called_once()
|
||||
|
||||
def test_run_success(
|
||||
self,
|
||||
search_tool,
|
||||
sample_raw_results,
|
||||
sample_cleaned_results,
|
||||
):
|
||||
"""Test successful synchronous run."""
|
||||
result, raw = search_tool._run("test query")
|
||||
|
||||
assert isinstance(result, str)
|
||||
assert isinstance(raw, dict)
|
||||
assert "results" in raw
|
||||
|
||||
result_data = json.loads(result)
|
||||
assert isinstance(result_data, list)
|
||||
assert len(result_data) > 0
|
||||
|
||||
def test_run_exception(self, search_tool):
|
||||
"""Test synchronous run with exception."""
|
||||
original_run = search_tool._run
|
||||
|
||||
def mock_run_with_error(query, **kwargs):
|
||||
return json.dumps({"error": "API Error"}, ensure_ascii=False), {}
|
||||
|
||||
try:
|
||||
search_tool._run = mock_run_with_error
|
||||
result, raw = search_tool._run("test query")
|
||||
|
||||
result_dict = json.loads(result)
|
||||
assert "error" in result_dict
|
||||
assert "API Error" in result_dict["error"]
|
||||
assert raw == {}
|
||||
finally:
|
||||
search_tool._run = original_run
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_arun_success(
|
||||
self,
|
||||
search_tool,
|
||||
sample_raw_results,
|
||||
sample_cleaned_results,
|
||||
):
|
||||
"""Test successful asynchronous run."""
|
||||
result, raw = await search_tool._arun("test query")
|
||||
|
||||
assert isinstance(result, str)
|
||||
assert isinstance(raw, dict)
|
||||
assert "results" in raw
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_arun_exception(self, search_tool):
|
||||
"""Test asynchronous run with exception."""
|
||||
original_arun = search_tool._arun
|
||||
|
||||
async def mock_arun_with_error(query, **kwargs):
|
||||
return json.dumps({"error": "Async API Error"}, ensure_ascii=False), {}
|
||||
|
||||
try:
|
||||
search_tool._arun = mock_arun_with_error
|
||||
result, raw = await search_tool._arun("test query")
|
||||
|
||||
result_dict = json.loads(result)
|
||||
assert "error" in result_dict
|
||||
assert "Async API Error" in result_dict["error"]
|
||||
assert raw == {}
|
||||
finally:
|
||||
search_tool._arun = original_arun
|
||||
|
||||
def test_run_with_run_manager(
|
||||
self,
|
||||
search_tool,
|
||||
sample_raw_results,
|
||||
sample_cleaned_results,
|
||||
):
|
||||
"""Test run with callback manager."""
|
||||
mock_run_manager = Mock()
|
||||
result, raw = search_tool._run("test query", run_manager=mock_run_manager)
|
||||
|
||||
assert isinstance(result, str)
|
||||
assert isinstance(raw, dict)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_arun_with_run_manager(
|
||||
self,
|
||||
search_tool,
|
||||
sample_raw_results,
|
||||
sample_cleaned_results,
|
||||
):
|
||||
"""Test async run with callback manager."""
|
||||
mock_run_manager = Mock()
|
||||
result, raw = await search_tool._arun("test query", run_manager=mock_run_manager)
|
||||
|
||||
assert isinstance(result, str)
|
||||
assert isinstance(raw, dict)
|
||||
|
||||
def test_api_wrapper_initialization_with_key(self):
|
||||
"""Test API wrapper initialization with key."""
|
||||
with patch('src.tools.infoquest_search.infoquest_search_results.InfoQuestAPIWrapper') as mock_wrapper_class:
|
||||
mock_instance = Mock()
|
||||
mock_wrapper_class.return_value = mock_instance
|
||||
|
||||
from src.tools.infoquest_search.infoquest_search_results import InfoQuestSearchResults
|
||||
|
||||
with patch.object(InfoQuestSearchResults, '__init__', return_value=None) as mock_init:
|
||||
InfoQuestSearchResults(infoquest_api_key="test-key")
|
||||
|
||||
mock_init.assert_called_once()
|
||||
Reference in New Issue
Block a user