W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents

How scaling width (parallel tool calling) alongside depth improves accuracy, cuts cost, and reduces latency for deep research agents.

Xiaoqiang Lin*, Jun Hao Liew*, Silvio Savarese, Junnan Li GitHub → MCP-Universe → DeepResearch

* Equal contribution.

The major LLM providers are now offering parallel tool calling capabilities, but its impact on deep research agents has remained largely unexplored. With parallel tool calling, deep research agents can now scale width (issuing multiple tool calls in a single step), not just depth (sequential steps). We show that this improves performance on BrowseComp, HLE, and GAIA while reducing turns, API cost, and wall-clock time. A descending tool-call scheduler (explore early, exploit later) adds a further ~6% gain, suggesting potential for future research on optimizing the deep research agent training. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT-5-Medium on BrowseComp, surpassing the original 54.9% reported by GPT-5-High.

Single vs parallel tool calling

(a) Single vs. parallel tool calling in a multi-step deep research agent trace. In parallel tool calling, the model performs a single reasoning step to issue multiple tool calls simultaneously; these calls are executed in parallel and their outputs are returned together into the agent's trace. (b) Top: Performance of different LLMs under parallel tool calling with varying # tool calls per step. Bottom: Average # turns required to complete the task with different # parallel tool calls.

1. Introduction

Deep research agents perform multi-step reasoning and information seeking to answer questions that would otherwise take hours of human work. At each step, they reason and then execute tools (e.g., search, scrape, code) to gather information from the web, and eventually produce a final answer or report.

Recent work has focused on scaling depth: more sequential steps of thinking and tool use (e.g., DeepSeek, MiroThinker, LongCat). At the same time, many state-of-the-art LLMs support parallel tool calling—making multiple tool calls in a single step—but this “scaling width” dimension has remained largely unexplored for deep research.

We introduce the Wide and Deep (W&D) research agent to study the joint scaling of depth and width. Unlike multi-agent orchestration (e.g., Kimi K2.5’s agent swarm) or parallel reasoning with separate paths (e.g., LongCat), our approach uses intrinsic parallel tool calling: one reasoning step issues multiple tool calls that are executed in parallel and then fed back together. This keeps coordination simple and fits into standard single-agent frameworks.

2. Parallel Tool Calling

In the usual sequential setup, at each step t the model outputs reasoning Rt and a single tool call At; the environment returns observation Ot, and the process repeats until the model gives a final answer.

With parallel tool calling, at step t the model emits m tool calls at once, {At(1), …, At(m)}. They are executed in parallel, yielding observations {Ot(1), …, Ot(m)} in one go (see the Figure (a) above). So we get:

We control the number of tool calls per step via a per-step user message (e.g., “make at least m but not more than m+1 function calls in a single response”). This gives precise, reproducible control for experiments.

3. Experimental Results

We use three deep-research benchmarks: BrowseComp, Humanity's Last Exam (HLE), and GAIA. Models are GPT-5 (Medium), Gemini 3.0 Pro, and Claude 4.5 Sonnet. The tool environment is from MCP-Universe (search tool and scraping tool with LLM summarization for all benchmarks and an additional Python interpreter tool for HLE).

Performance and efficiency

Higher accuracy with fewer iterations. Parallel tool calling improves both accuracy and efficiency: it achieves higher accuracy while reducing the number of turns needed to complete the task (see the left figure below).

With lower max turn limit, more tools seems to be always better (see the right figure below). Whereas with a higher max turn limit, a moderate number of tools per step (e.g., 3) is often best, suggesting that varying the number of tool calls across steps can help which leads to our tool-call scheduler experiments later (Section 5).

BrowseComp accuracy vs turns and tools per turn

(Left) BrowseComp accuracy against average number of turns. (Right) Accuracy against number of tools per turn.

The table below summarizes accuracy and average number of iterations to completion for different tool-call limits per step on BrowseComp.

Accuracy and average number of iterations to completion (in brackets) on the BrowseComp dataset across different tool call limits per iteration. No tool call implies the LLM answers solely via reasoning without tools. n iters indicates the agent is forced to answer and stop at the n-th iteration.
No tool call 10 iters 25 iters 50 iters 100 iters 150 iters 300 iters
1 Tool44 (9.7)55 (20.3)60 (31.9)66 (45.7)70 (49.8)67 (54.7)
2 Tools45 (9.6)62 (17.4)63 (26.5)68 (27.1)66 (26.2)
3 Tools8 (1.0)48 (9.2)68 (16.0)66 (21.0)68 (23.8)73 (21.0)
5 Tools53 (9.1)65 (15.6)64 (18.4)60 (19.9)
8 Tools56 (10.0)67 (13.4)65 (15.7)63 (15.2)

Example cost and time (100 tasks from BrowseComp): Single tool calling at 66% accuracy → ~$102.5 and ~1523 s. Parallel (3 tools/turn) at 68% accuracy → ~$65.7 (~36% cost reduction) and ~904 s (~41% time reduction).

~36%
Cost reduction (parallel vs single at similar accuracy)
~41%
Wall-clock time reduction
74%
Best accuracy (Descending scheduler, BrowseComp, see Section 5)

Generalization across models and benchmarks. These gains hold across different LLMs (GPT-5, Gemini, Claude) and across benchmarks (BrowseComp, HLE, GAIA). Scaling width consistently improves accuracy and reduces average turns in all settings.

Scaling across benchmarks and models

Scaling of tool calls. (Top row) Performance of GPT-5-medium across different benchmarks. (Bottom row) Performance of different models on BrowseComp benchmark.

To validate whether open-source models also benefit, we evaluated DeepSeek-V3.2 and Qwen3-235B-A22B-Thinking-2507 on BrowseComp. Table below shows a small gain with parallel tool calling, but less pronounced than for SoTA proprietary models—suggesting room for improvement in open-source training for parallel tool calling.

Accuracy and average number of iterations to completion (in brackets) on BrowseComp for open-source models: single vs parallel (3 tools) tool calling.
Single tool calling Parallel tool calling (3 tools)
Qwen3-235B-A22B-Thinking-25078 (18.5)11 (52.1)
DeepSeek-V3.238 (78.0)39 (52.5)

4. Why Parallel Tool Calling Improves Accuracy

We inspected many agent traces and identified three main drivers.

Observation 1: Broader exploration improves source credibility. Parallel calls trigger multiple queries and aggregate diverse sources. The model can compare and choose the most authoritative (e.g., official UN report vs. third-party API). In one HLE-style question on UN parliamentary data, parallel calling led to the correct UN Statistical Yearbook value (24.3%); single-call relied on an API that gave 22.25% and was wrong for the intended metric.

Observation 2: Redundancy enables verification and guards against unreliable tools. With a single call, the agent may trust a faulty or unreliable tool output. With parallel calls, the model often requests similar information with different arguments; inconsistent results signal unreliability and trigger further tool calls for verification. In a tuition query, the single-call agent accepted a hallucinated scrape result from the LLM summarization; the parallel-call agent detected inconsistency and did further scraping to get the correct data.

Observation 3: Query decomposition improves retrieval. Complex, multi-faceted questions are hard for one stuffed query. Parallel tool calling lets the model decompose the request into several simpler queries (e.g., “1990 World Cup referee list” and “1994 World Cup referee list” instead of one long keyword string "1990 ... 1994 World Cup referee list"), improving recall and success.

5. Tool Call Scheduler

Inspired by the previous results, we varied the number of tool calls per step over time using simple schedulers to further improve the performance:

Accuracy and average number of iterations to completion (in brackets) on the BrowseComp dataset across different tool call schedulers.
Constant 1 Tool Constant 3 Tools Ascending Descending Automatic
Accuracy (#Turns) 66 (45.7) 68 (23.8) 63 (36.5) 74 (23.5) 72 (26.6)

Descending gives about a 6% gain over Constant 3 Tools and outperforms Ascending and Automatic, suggesting that an explore-then-exploit schedule is helpful and that current LLMs do not yet optimize this trade-off on their own. Future work could train agents (e.g., via RL) to manage the width–depth trade-off dynamically.

Average tool calls per turn for each scheduler

The average # tool calls across all turns for different tool call scheduler.

6. Conclusion

We introduced the Wide and Deep research agent and showed that scaling width via parallel tool calling improves performance on BrowseComp, HLE, and GAIA while reducing the number of turns, end-to-end latency, and API cost. Gains come from better source verification, redundancy against tool failures, and query decomposition. A Descending tool-call scheduler (more parallel calls early, fewer later) adds a substantial accuracy boost; having the LLM choose the number of calls (Automatic) does not match it yet. We see optimizing the width–depth trade-off, possibly with learned schedulers, as a promising future direction for next-generation, high-efficiency deep research agents.