703 lines
18 KiB
Markdown
703 lines
18 KiB
Markdown
|
|
# Deep Research System - 开发文档
|
|||
|
|
|
|||
|
|
**框架:** DeepAgents (LangChain) | **最后更新:** 2025-10-31
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📖 文档说明
|
|||
|
|
|
|||
|
|
本文档专注于**技术实现细节**。
|
|||
|
|
|
|||
|
|
**相关文档**:
|
|||
|
|
- [需求文档_V1.md](./需求文档_V1.md) - 产品需求和业务逻辑
|
|||
|
|
- [开发流程指南.md](./开发流程指南.md) - 开发优先级、工作流程、代码审查
|
|||
|
|
- [.claude/agents/code-reviewer.md](./.claude/agents/code-reviewer.md) - 代码审查规范
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 系统架构
|
|||
|
|
|
|||
|
|
### Agent 结构(1主 + 6子)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
ResearchCoordinator (主Agent)
|
|||
|
|
├── intent-analyzer (意图分析→search_queries.json)
|
|||
|
|
├── search-orchestrator (并行搜索→search_results.json)
|
|||
|
|
├── source-validator (来源验证→sources.json)
|
|||
|
|
├── content-analyzer (内容分析→findings.json)
|
|||
|
|
├── confidence-evaluator (置信度评估→confidence.json + iteration_decision.json)
|
|||
|
|
└── report-generator (报告生成→final_report.md)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 执行流程
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
用户输入 → ResearchCoordinator
|
|||
|
|
|
|||
|
|
【第1步】调用 intent-analyzer → /search_queries.json
|
|||
|
|
|
|||
|
|
【迭代循环】(第N轮)
|
|||
|
|
【第2步】调用 search-orchestrator → /iteration_N/search_results.json
|
|||
|
|
【第3步】调用 source-validator → /iteration_N/sources.json
|
|||
|
|
【第4步】调用 content-analyzer → /iteration_N/findings.json
|
|||
|
|
【第5步】调用 confidence-evaluator → /iteration_N/confidence.json
|
|||
|
|
/iteration_decision.json
|
|||
|
|
【第6步】主Agent读取 iteration_decision.json
|
|||
|
|
├─ CONTINUE → 生成补充查询 → 回到第2步
|
|||
|
|
└─ FINISH → 进入第7步
|
|||
|
|
|
|||
|
|
【第7步】调用 report-generator → /final_report.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键要点:**
|
|||
|
|
- ✅ 主Agent通过**系统提示词**引导,不是Python while循环
|
|||
|
|
- ✅ 通过**读取文件**判断状态,不是函数返回值
|
|||
|
|
- ✅ SubAgent通过**虚拟文件系统**共享数据
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 技术栈
|
|||
|
|
|
|||
|
|
### 环境配置
|
|||
|
|
|
|||
|
|
**虚拟环境:** `deep_research_env` (Python 3.11.x, Anaconda)
|
|||
|
|
|
|||
|
|
#### 创建虚拟环境(如果还未创建)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 创建虚拟环境
|
|||
|
|
conda create -n deep_research_env python=3.11 -y
|
|||
|
|
|
|||
|
|
# 激活虚拟环境
|
|||
|
|
conda activate deep_research_env
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 安装依赖包
|
|||
|
|
|
|||
|
|
**requirements.txt:**
|
|||
|
|
```
|
|||
|
|
# 核心框架
|
|||
|
|
deepagents>=0.1.0
|
|||
|
|
langchain>=0.3.0
|
|||
|
|
langchain-openai>=0.2.0
|
|||
|
|
langchain-community>=0.3.0
|
|||
|
|
langgraph>=0.2.0
|
|||
|
|
|
|||
|
|
# 搜索工具
|
|||
|
|
tavily-python>=0.5.0
|
|||
|
|
|
|||
|
|
# 环境变量管理
|
|||
|
|
python-dotenv>=1.0.0
|
|||
|
|
|
|||
|
|
# CLI和进度显示
|
|||
|
|
rich>=13.0.0
|
|||
|
|
click>=8.1.0
|
|||
|
|
|
|||
|
|
# 工具和实用库
|
|||
|
|
typing-extensions>=4.12.0
|
|||
|
|
pydantic>=2.0.0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**安装命令:**
|
|||
|
|
```bash
|
|||
|
|
# 确保已激活虚拟环境
|
|||
|
|
conda activate deep_research_env
|
|||
|
|
|
|||
|
|
# 安装依赖
|
|||
|
|
pip install -r requirements.txt
|
|||
|
|
|
|||
|
|
# 验证安装
|
|||
|
|
python -c "import deepagents; print('DeepAgents installed successfully')"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 核心框架
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from deepagents import create_deep_agent
|
|||
|
|
from langchain_openai import ChatOpenAI
|
|||
|
|
|
|||
|
|
# DeepAgents 自动附加三个核心中间件:
|
|||
|
|
# - TodoListMiddleware → write_todos 工具
|
|||
|
|
# - FilesystemMiddleware → ls, read_file, write_file, edit_file, glob, grep
|
|||
|
|
# - SubAgentMiddleware → task 工具
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### API配置
|
|||
|
|
|
|||
|
|
**.env 文件:**
|
|||
|
|
```bash
|
|||
|
|
DASHSCOPE_API_KEY=your_dashscope_key_here
|
|||
|
|
TAVILY_API_KEY=your_tavily_key_here
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**src/config.py:**
|
|||
|
|
```python
|
|||
|
|
import os
|
|||
|
|
from dotenv import load_dotenv
|
|||
|
|
from langchain_openai import ChatOpenAI
|
|||
|
|
|
|||
|
|
load_dotenv()
|
|||
|
|
|
|||
|
|
llm = ChatOpenAI(
|
|||
|
|
model="qwen-max",
|
|||
|
|
openai_api_key=os.environ.get("DASHSCOPE_API_KEY"),
|
|||
|
|
openai_api_base="https://dashscope.aliyunapis.com/compatible-mode/v1",
|
|||
|
|
timeout=60,
|
|||
|
|
max_retries=2
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
TAVILY_API_KEY = os.environ.get("TAVILY_API_KEY")
|
|||
|
|
|
|||
|
|
ERROR_HANDLING_CONFIG = {
|
|||
|
|
"max_retries": 3,
|
|||
|
|
"retry_delay": 1.0,
|
|||
|
|
"backoff_factor": 2.0,
|
|||
|
|
"timeout": {"search": 30, "subagent": 120, "total": 600}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**安全:**
|
|||
|
|
- ⚠️ 不要提交 `.env` 到版本控制
|
|||
|
|
- ✅ 在 `.gitignore` 中添加 `.env`
|
|||
|
|
- ✅ 提供 `.env.example` 模板
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 虚拟文件系统
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
/
|
|||
|
|
├── question.txt # 原始问题
|
|||
|
|
├── config.json # 研究配置
|
|||
|
|
├── search_queries.json # 搜索查询列表
|
|||
|
|
├── iteration_1/
|
|||
|
|
│ ├── search_results.json # 搜索结果
|
|||
|
|
│ ├── sources.json # 验证的来源(Tier分级)
|
|||
|
|
│ ├── findings.json # 分析发现
|
|||
|
|
│ └── confidence.json # 置信度评估
|
|||
|
|
├── iteration_2/
|
|||
|
|
│ └── ...
|
|||
|
|
├── iteration_decision.json # {"decision": "CONTINUE/FINISH", "reason": "..."}
|
|||
|
|
└── final_report.md # 最终报告
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**config.json 格式:**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"depth_mode": "standard",
|
|||
|
|
"target_confidence": 0.7,
|
|||
|
|
"min_tier": 2,
|
|||
|
|
"max_iterations": 3,
|
|||
|
|
"parallel_searches": 5,
|
|||
|
|
"report_format": "technical"
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## SubAgent 配置
|
|||
|
|
|
|||
|
|
### 配置格式规范
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
subagents = [
|
|||
|
|
{
|
|||
|
|
"name": "subagent-name", # 必须:kebab-case格式
|
|||
|
|
"description": "简短描述", # 必须
|
|||
|
|
"system_prompt": "详细提示词", # 必须:不是prompt!
|
|||
|
|
"tools": [tool1, tool2], # 可选:工具实例列表
|
|||
|
|
"model": "openai:gpt-4o" # 可选
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 6个SubAgent配置示例
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from deepagents import create_deep_agent
|
|||
|
|
from src.tools.search_tools import create_batch_search_tool
|
|||
|
|
|
|||
|
|
batch_internet_search = create_batch_search_tool()
|
|||
|
|
|
|||
|
|
subagents = [
|
|||
|
|
{
|
|||
|
|
"name": "intent-analyzer",
|
|||
|
|
"description": "分析用户意图并生成搜索查询",
|
|||
|
|
"system_prompt": """你是意图分析专家。
|
|||
|
|
|
|||
|
|
【任务】
|
|||
|
|
1. 读取 /question.txt 和 /config.json
|
|||
|
|
2. 识别领域类型(technical/academic/general)
|
|||
|
|
3. 提取3-8个核心关键词
|
|||
|
|
4. 根据 parallel_searches 数量生成查询
|
|||
|
|
|
|||
|
|
【输出】写入 /search_queries.json:
|
|||
|
|
{
|
|||
|
|
"domain": "technical",
|
|||
|
|
"keywords": ["关键词1", "关键词2"],
|
|||
|
|
"queries": ["查询1", "查询2", "查询3"]
|
|||
|
|
}""",
|
|||
|
|
"tools": []
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
{
|
|||
|
|
"name": "search-orchestrator",
|
|||
|
|
"description": "执行并行搜索并聚合去重结果",
|
|||
|
|
"system_prompt": """你是搜索协调专家。
|
|||
|
|
|
|||
|
|
【任务】
|
|||
|
|
1. 读取 /search_queries.json
|
|||
|
|
2. 使用 batch_internet_search 工具执行批量搜索
|
|||
|
|
3. 聚合结果,按URL去重
|
|||
|
|
4. 标准化格式
|
|||
|
|
|
|||
|
|
【输出】写入 /iteration_N/search_results.json:
|
|||
|
|
[
|
|||
|
|
{
|
|||
|
|
"url": "https://...",
|
|||
|
|
"title": "...",
|
|||
|
|
"snippet": "...",
|
|||
|
|
"published_date": "YYYY-MM-DD",
|
|||
|
|
"source_type": "official_doc|blog|forum|paper"
|
|||
|
|
}
|
|||
|
|
]""",
|
|||
|
|
"tools": [batch_internet_search]
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
{
|
|||
|
|
"name": "source-validator",
|
|||
|
|
"description": "验证来源可信度并进行Tier分级",
|
|||
|
|
"system_prompt": """你是来源验证专家。
|
|||
|
|
|
|||
|
|
【Tier分级标准】
|
|||
|
|
- Tier 1 (0.9-1.0): 官方文档、权威期刊、标准组织
|
|||
|
|
- Tier 2 (0.7-0.9): MDN、Stack Overflow高分、大厂博客
|
|||
|
|
- Tier 3 (0.5-0.7): 高质量教程、维基百科
|
|||
|
|
- Tier 4 (0.3-0.5): 论坛、个人博客
|
|||
|
|
|
|||
|
|
【任务】
|
|||
|
|
1. 读取 /iteration_N/search_results.json
|
|||
|
|
2. 为每个来源分配Tier级别和分数
|
|||
|
|
3. 统计质量指标
|
|||
|
|
4. 判断是否满足要求(总数≥5, Tier1-2≥3)
|
|||
|
|
|
|||
|
|
【输出】写入 /iteration_N/sources.json:
|
|||
|
|
{
|
|||
|
|
"sources": [{"url": "...", "tier": 1, "tier_score": 0.95, ...}],
|
|||
|
|
"quality_check": {
|
|||
|
|
"total_count": 18,
|
|||
|
|
"tier1_count": 5,
|
|||
|
|
"tier2_count": 8,
|
|||
|
|
"meets_requirement": true
|
|||
|
|
}
|
|||
|
|
}""",
|
|||
|
|
"tools": []
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
{
|
|||
|
|
"name": "content-analyzer",
|
|||
|
|
"description": "提取内容、交叉验证并检测矛盾",
|
|||
|
|
"system_prompt": """你是内容分析专家。
|
|||
|
|
|
|||
|
|
【任务】
|
|||
|
|
1. 读取 /iteration_N/sources.json
|
|||
|
|
2. 对每个来源提取关键信息
|
|||
|
|
3. 按主题分组
|
|||
|
|
4. 交叉验证:多个来源支持同一结论
|
|||
|
|
5. 检测矛盾:不同来源对同一事实的冲突
|
|||
|
|
6. 识别知识缺口
|
|||
|
|
|
|||
|
|
【输出】写入 /iteration_N/findings.json:
|
|||
|
|
{
|
|||
|
|
"findings": [
|
|||
|
|
{
|
|||
|
|
"topic": "主题1",
|
|||
|
|
"statement": "关键发现",
|
|||
|
|
"supporting_sources": ["url1", "url2"],
|
|||
|
|
"contradicting_sources": [],
|
|||
|
|
"evidence": ["证据1", "证据2"]
|
|||
|
|
}
|
|||
|
|
],
|
|||
|
|
"contradictions": [...],
|
|||
|
|
"knowledge_gaps": ["缺失信息1", "缺失信息2"]
|
|||
|
|
}""",
|
|||
|
|
"tools": [batch_internet_search]
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
{
|
|||
|
|
"name": "confidence-evaluator",
|
|||
|
|
"description": "计算置信度并决定是否继续迭代",
|
|||
|
|
"system_prompt": """你是置信度评估专家。
|
|||
|
|
|
|||
|
|
【置信度公式】
|
|||
|
|
置信度 = (来源可信度 × 50%) + (交叉验证 × 30%) + (时效性 × 20%)
|
|||
|
|
|
|||
|
|
【评分细则】
|
|||
|
|
- 来源可信度: Tier1=0.95, Tier2=0.80, Tier3=0.65, Tier4=0.45 (平均值)
|
|||
|
|
- 交叉验证: 1源=0.4, 2-3源=0.7, 4+源=1.0, 有矛盾-0.3
|
|||
|
|
- 时效性: <6月=1.0, 6-12月=0.9, 1-2年=0.7, 2-3年=0.5, >3年=0.3
|
|||
|
|
|
|||
|
|
【任务】
|
|||
|
|
1. 读取 /iteration_N/sources.json 和 /iteration_N/findings.json
|
|||
|
|
2. 为每个finding计算置信度
|
|||
|
|
3. 计算整体平均置信度
|
|||
|
|
4. 读取 /config.json 获取 target_confidence 和 max_iterations
|
|||
|
|
5. 决策是否继续迭代
|
|||
|
|
|
|||
|
|
【决策逻辑】
|
|||
|
|
- overall_confidence ≥ target → FINISH
|
|||
|
|
- 未达标 且 current_iteration < max → CONTINUE
|
|||
|
|
- 达到 max → FINISH(标记未达标)
|
|||
|
|
|
|||
|
|
【输出】
|
|||
|
|
1. 写入 /iteration_N/confidence.json:
|
|||
|
|
{"findings_confidence": [...], "overall_confidence": 0.78}
|
|||
|
|
2. 写入 /iteration_decision.json:
|
|||
|
|
{"decision": "CONTINUE", "current_iteration": 1, "reason": "..."}""",
|
|||
|
|
"tools": []
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
{
|
|||
|
|
"name": "report-generator",
|
|||
|
|
"description": "生成技术或学术研究报告",
|
|||
|
|
"system_prompt": """你是报告生成专家。
|
|||
|
|
|
|||
|
|
【任务】
|
|||
|
|
1. 读取所有迭代的数据:
|
|||
|
|
- /question.txt
|
|||
|
|
- /config.json
|
|||
|
|
- /iteration_*/findings.json
|
|||
|
|
- /iteration_*/sources.json
|
|||
|
|
- /iteration_*/confidence.json
|
|||
|
|
2. 根据 report_format 选择报告结构(technical/academic)
|
|||
|
|
3. 生成完整报告
|
|||
|
|
|
|||
|
|
【技术报告结构】
|
|||
|
|
# 技术研究报告:{主题}
|
|||
|
|
## 📊 研究元信息
|
|||
|
|
## 🎯 执行摘要
|
|||
|
|
## 🔍 关键发现
|
|||
|
|
## 📊 来源可信度矩阵
|
|||
|
|
## ⚠️ 矛盾和不确定性
|
|||
|
|
## 📚 参考文献
|
|||
|
|
|
|||
|
|
【输出】写入 /final_report.md""",
|
|||
|
|
"tools": []
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 创建主Agent
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from src.config import llm
|
|||
|
|
|
|||
|
|
coordinator = create_deep_agent(
|
|||
|
|
model=llm,
|
|||
|
|
system_prompt=COORDINATOR_SYSTEM_PROMPT, # 见下一章节
|
|||
|
|
tools=[],
|
|||
|
|
subagents=subagents
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 主Agent系统提示词(核心)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
COORDINATOR_SYSTEM_PROMPT = """
|
|||
|
|
你是深度研究协调专家,通过调用SubAgent和管理虚拟文件系统完成复杂研究任务。
|
|||
|
|
|
|||
|
|
# 核心原则
|
|||
|
|
- 通过 task 工具调用SubAgent
|
|||
|
|
- 通过 read_file 读取SubAgent的输出
|
|||
|
|
- 通过 write_todos 管理任务进度
|
|||
|
|
- 根据文件内容自主决策下一步(不是Python循环)
|
|||
|
|
|
|||
|
|
# 执行流程
|
|||
|
|
|
|||
|
|
## 初始化
|
|||
|
|
1. 读取 /question.txt 和 /config.json
|
|||
|
|
2. 创建任务列表:write_todos([{"task": "意图分析", "status": "pending"}, ...])
|
|||
|
|
|
|||
|
|
## 第1步:意图分析
|
|||
|
|
1. 更新进度:write_todos([{"task": "意图分析", "status": "in_progress"}, ...])
|
|||
|
|
2. 调用:task(name="intent-analyzer")
|
|||
|
|
3. 读取:read_file("/search_queries.json")
|
|||
|
|
4. 完成:write_todos([{"task": "意图分析", "status": "completed"}, ...])
|
|||
|
|
|
|||
|
|
## 第2-6步:研究迭代(最多 max_iterations 轮)
|
|||
|
|
|
|||
|
|
**依次执行SubAgent:**
|
|||
|
|
1. search-orchestrator → /iteration_N/search_results.json
|
|||
|
|
2. source-validator → /iteration_N/sources.json
|
|||
|
|
3. content-analyzer → /iteration_N/findings.json
|
|||
|
|
4. confidence-evaluator → /iteration_N/confidence.json + /iteration_decision.json
|
|||
|
|
|
|||
|
|
**迭代决策:**
|
|||
|
|
读取 /iteration_decision.json:
|
|||
|
|
- decision="FINISH" → 跳转第7步
|
|||
|
|
- decision="CONTINUE" 且 current_iteration < max → 生成补充查询,回到步骤2.1
|
|||
|
|
- 达到 max_iterations → 跳转第7步
|
|||
|
|
|
|||
|
|
## 第7步:生成报告
|
|||
|
|
1. 更新进度
|
|||
|
|
2. 调用:task(name="report-generator")
|
|||
|
|
3. 读取:/final_report.md
|
|||
|
|
4. 返回报告路径给用户
|
|||
|
|
|
|||
|
|
# 错误处理
|
|||
|
|
|
|||
|
|
## SubAgent调用失败
|
|||
|
|
- 超时 → 降低并行度,重试1次
|
|||
|
|
- API限流 → 等待30秒,重试1次
|
|||
|
|
- 其他 → 记录错误,继续流程(降级运行)
|
|||
|
|
|
|||
|
|
## 搜索质量不足
|
|||
|
|
- meets_requirement: false → 生成更广泛查询,重新搜索(最多扩展2次)
|
|||
|
|
|
|||
|
|
## 置信度无法达标
|
|||
|
|
- 达到最大迭代轮次仍未达标 → 强制结束,在报告中标注未达标
|
|||
|
|
|
|||
|
|
## 部分失败容错
|
|||
|
|
- 5个查询中2个失败 → 使用3个成功的继续
|
|||
|
|
- 在报告元信息中记录失败统计
|
|||
|
|
|
|||
|
|
# 进度监控
|
|||
|
|
同时维护 /progress.json:
|
|||
|
|
{
|
|||
|
|
"current_step": "search-orchestrator",
|
|||
|
|
"iteration": 2,
|
|||
|
|
"total_iterations": 3,
|
|||
|
|
"estimated_completion": "60%",
|
|||
|
|
"eta_seconds": 180
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 重要提醒
|
|||
|
|
1. **不要使用Python while循环** - LangGraph会持续调用你
|
|||
|
|
2. **通过文件判断状态** - 不是返回值
|
|||
|
|
3. **自主决策每一步** - 你自己判断
|
|||
|
|
4. **失败不致命** - 降级运行,保证能产出报告
|
|||
|
|
"""
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 自定义工具:批量并行搜索
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# src/tools/search_tools.py
|
|||
|
|
|
|||
|
|
import os
|
|||
|
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
|||
|
|
from langchain_community.tools.tavily_search import TavilySearchResults
|
|||
|
|
from langchain.tools import tool
|
|||
|
|
from typing import List, Dict
|
|||
|
|
|
|||
|
|
def create_batch_search_tool():
|
|||
|
|
tavily = TavilySearchResults(
|
|||
|
|
api_key=os.environ.get("TAVILY_API_KEY"),
|
|||
|
|
max_results=10,
|
|||
|
|
search_depth="advanced",
|
|||
|
|
include_raw_content=False
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
@tool
|
|||
|
|
def batch_internet_search(queries: List[str]) -> List[Dict]:
|
|||
|
|
"""
|
|||
|
|
并行执行多个搜索查询并聚合去重结果
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
queries: 搜索查询列表
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
聚合的搜索结果列表(已去重、按相关性排序)
|
|||
|
|
"""
|
|||
|
|
def search_single(query: str) -> List[Dict]:
|
|||
|
|
try:
|
|||
|
|
results = tavily.invoke(query)
|
|||
|
|
for r in results:
|
|||
|
|
r['query'] = query
|
|||
|
|
return results
|
|||
|
|
except Exception as e:
|
|||
|
|
print(f"搜索失败 '{query}': {e}")
|
|||
|
|
return []
|
|||
|
|
|
|||
|
|
all_results = []
|
|||
|
|
with ThreadPoolExecutor(max_workers=5) as executor:
|
|||
|
|
future_to_query = {executor.submit(search_single, q): q for q in queries}
|
|||
|
|
|
|||
|
|
for future in as_completed(future_to_query):
|
|||
|
|
query = future_to_query[future]
|
|||
|
|
try:
|
|||
|
|
results = future.result(timeout=30)
|
|||
|
|
all_results.extend(results)
|
|||
|
|
except Exception as e:
|
|||
|
|
print(f"查询超时/失败 '{query}': {e}")
|
|||
|
|
|
|||
|
|
# URL去重(保留相关性更高的)
|
|||
|
|
seen_urls = {}
|
|||
|
|
for result in all_results:
|
|||
|
|
url = result.get('url')
|
|||
|
|
score = result.get('score', 0)
|
|||
|
|
if url not in seen_urls or seen_urls[url]['score'] < score:
|
|||
|
|
seen_urls[url] = result
|
|||
|
|
|
|||
|
|
# 按相关性分数排序
|
|||
|
|
unique_results = sorted(
|
|||
|
|
seen_urls.values(),
|
|||
|
|
key=lambda x: x.get('score', 0),
|
|||
|
|
reverse=True
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
return unique_results
|
|||
|
|
|
|||
|
|
return batch_internet_search
|
|||
|
|
|
|||
|
|
# 创建工具实例
|
|||
|
|
batch_internet_search = create_batch_search_tool()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**为什么不需要 calculate_tier 和 calculate_confidence 工具?**
|
|||
|
|
- LLM具备强大推理能力,在 system_prompt 中说明标准即可
|
|||
|
|
- Tier判断需要上下文理解(域名+内容类型+时间),LLM更适合
|
|||
|
|
- 避免过度工具化,提高灵活性
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 项目结构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
deep_research/
|
|||
|
|
├── .env # 环境变量(不提交)
|
|||
|
|
├── .env.example # 环境变量模板
|
|||
|
|
├── .gitignore
|
|||
|
|
├── requirements.txt
|
|||
|
|
├── README.md
|
|||
|
|
│
|
|||
|
|
├── src/
|
|||
|
|
│ ├── __init__.py
|
|||
|
|
│ ├── config.py # API配置
|
|||
|
|
│ ├── main.py # CLI入口
|
|||
|
|
│ │
|
|||
|
|
│ ├── agents/
|
|||
|
|
│ │ ├── __init__.py
|
|||
|
|
│ │ ├── coordinator.py # ResearchCoordinator主Agent
|
|||
|
|
│ │ └── subagents.py # 6个SubAgent配置
|
|||
|
|
│ │
|
|||
|
|
│ ├── tools/
|
|||
|
|
│ │ ├── __init__.py
|
|||
|
|
│ │ └── search_tools.py # batch_internet_search
|
|||
|
|
│ │
|
|||
|
|
│ └── cli/
|
|||
|
|
│ ├── __init__.py
|
|||
|
|
│ └── commands.py # research, config, history, resume命令
|
|||
|
|
│
|
|||
|
|
├── tests/
|
|||
|
|
│ ├── __init__.py
|
|||
|
|
│ ├── test_subagents.py
|
|||
|
|
│ ├── test_tools.py
|
|||
|
|
│ └── test_integration.py
|
|||
|
|
│
|
|||
|
|
└── outputs/
|
|||
|
|
└── .gitkeep
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 错误处理配置
|
|||
|
|
|
|||
|
|
已在 config.py 中定义:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
ERROR_HANDLING_CONFIG = {
|
|||
|
|
"max_retries": 3,
|
|||
|
|
"retry_delay": 1.0,
|
|||
|
|
"backoff_factor": 2.0,
|
|||
|
|
"timeout": {
|
|||
|
|
"search": 30,
|
|||
|
|
"subagent": 120,
|
|||
|
|
"total": 600
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 降级策略
|
|||
|
|
|
|||
|
|
| 场景 | 降级措施 | 影响 |
|
|||
|
|
|------|---------|------|
|
|||
|
|
| 搜索API超时 | 减少并行查询数 | 速度变慢 |
|
|||
|
|
| 高质量来源不足 | 降低min_tier要求 | 置信度降低 |
|
|||
|
|
| 迭代超时 | 提前结束,生成报告 | 覆盖度降低 |
|
|||
|
|
| LLM限流 | 指数退避重试 | 延迟增加 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 进度跟踪
|
|||
|
|
|
|||
|
|
### TodoListMiddleware使用
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 研究开始时
|
|||
|
|
write_todos([
|
|||
|
|
{"task": "意图分析", "status": "pending"},
|
|||
|
|
{"task": "第1轮搜索", "status": "pending"},
|
|||
|
|
{"task": "第1轮来源验证", "status": "pending"},
|
|||
|
|
{"task": "第1轮内容分析", "status": "pending"},
|
|||
|
|
{"task": "第1轮置信度评估", "status": "pending"},
|
|||
|
|
{"task": "生成报告", "status": "pending"}
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
# 每完成一步更新
|
|||
|
|
write_todos([
|
|||
|
|
{"task": "意图分析", "status": "completed"},
|
|||
|
|
{"task": "第1轮搜索", "status": "in_progress"},
|
|||
|
|
...
|
|||
|
|
])
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### CLI进度显示
|
|||
|
|
|
|||
|
|
使用Rich库实现实时进度:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# src/cli/commands.py
|
|||
|
|
from rich.console import Console
|
|||
|
|
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
|
|||
|
|
|
|||
|
|
def research_command(topic: str, **options):
|
|||
|
|
console = Console()
|
|||
|
|
|
|||
|
|
with Progress(
|
|||
|
|
SpinnerColumn(),
|
|||
|
|
TextColumn("[bold blue]{task.description}"),
|
|||
|
|
BarColumn(),
|
|||
|
|
TextColumn("[progress.percentage]{task.percentage:>3.0f}%"),
|
|||
|
|
) as progress:
|
|||
|
|
research_task = progress.add_task("[cyan]深度研究中...", total=100)
|
|||
|
|
|
|||
|
|
# 定期读取 /progress.json 更新进度条
|
|||
|
|
while not completed:
|
|||
|
|
progress_data = read_progress()
|
|||
|
|
progress.update(
|
|||
|
|
research_task,
|
|||
|
|
completed=progress_data['estimated_completion'],
|
|||
|
|
description=f"[cyan]{progress_data['current_step']}"
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 参考资源
|
|||
|
|
|
|||
|
|
- **DeepAgents官方文档**: https://github.com/langchain-ai/deepagents
|
|||
|
|
- **DeepAgents博客**: https://blog.langchain.com/deep-agents/
|
|||
|
|
- **LangChain Agents文档**: https://docs.langchain.com/oss/python/langchain/agents
|
|||
|
|
- **Tavily Search API**: https://tavily.com/
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**文档版本:** 1.0 | **最后更新:** 2025-10-31
|