QuestionBench: Measuring How Well AI Agents Ask Questions

We built a benchmark to test whether AI models can strategically ask questions to gather information they don’t know. Claude Opus 4.5 with extended thinking achieved 78.30% task success.

  ·   13 min read

QuestionBench: Measuring How Well AI Agents Ask Questions

TL;DR: We built a benchmark to test whether AI models can strategically ask questions to gather information they don’t know. Claude Opus 4.5 with extended thinking achieved 78.30% task success, followed by Claude + Explicit Memory tools (77.20%) outperforming all other agents and base models.

Two surprising findings:

(1) An Oracle agent with direct access to all user information only achieved 68.40% success, proving that having information isn’t enough without reasoning about relevance.

(2) GPT-5.2 with memory achieved 87.13% question appropriateness (vs Claude’s 55-60%) while asking only 2.1 questions instead of 5, suggesting fundamentally different strategic approaches between models.

The Problem: AI Models Aren’t Built to Ask Questions #

Most AI models are trained to answer questions, not ask them. When you interact with ChatGPT, Claude, or Gemini, they respond to your prompts with information they already have (or think they have). But what happens when an AI needs to complete a task that requires information it doesn’t possess?

Consider a simple scenario: recommend three movies for a user. A human would naturally ask “What genres do you like?” or “What’s the last movie you really enjoyed?” But AI models, trained primarily on question-answering, don’t have this behavior built in. They might just guess based on general popularity or make recommendations without gathering any user-specific information.

This limitation becomes critical as we move toward more agentic AI systems that need to operate autonomously, gather information proactively, and make decisions based on incomplete knowledge.

What We Tested: QuestionBench #

QuestionBench evaluates how well AI agents can acquire information through strategic questioning. We created 100 different scenarios across six domains (preferences, professional, health, social, temporal, and spatial) where agents must:

  1. Ask questions to learn about a simulated user (persona)
  2. Use that information to complete a task (like recommending movies, suggesting restaurants, or planning events)
  3. Stay within a question budget (typically 5 questions)

Example Scenario #

Task: Recommend 3 movies the user would enjoy What the agent knows: Nothing initially What it needs to learn: User’s favorite movie genres, recent movies they enjoyed Question budget: 5 questions Success criteria: Do the recommended movies match the user’s actual preferences?

The catch is that agents had to figure out what to ask and when to stop asking. Some information is more valuable than others. And some questions might be inappropriate (too sensitive, too broad, or redundant).

The Agents We Tested #

We evaluated two different approaches:

AI-Powered Agents #

Claude Opus 4.5 with Extended Thinking: Anthropic’s most capable model with “extended thinking” mode enabled. This gives the model extra computational budget to reason about what questions to ask before committing to one. We had to configure it carefully (temperature must be 1.0 for extended thinking, and max output tokens must exceed the thinking budget).

Claude Sonnet 4.5: The baseline Claude model without extended thinking or special tools.

Claude with Memory Tools: Claude Sonnet equipped with explicit memory management functions (record_fact, check_if_known, list_known_facts). This lets the model externalize its memory rather than relying solely on the conversation context.

GPT-5.2 with Memory: OpenAI’s GPT-5.2 model with an explicit memory management system to track what it has learned.

Baseline Heuristics #

To understand how well AI models perform, we compared them against several rule-based approaches:

Oracle: A “cheating” agent that has direct access to the user’s persona data. It represents the theoretical best case for efficiency (it can complete tasks with minimal questions because it already knows everything).

Uncertainty Sampling: Asks questions to reduce uncertainty about which user attributes it doesn’t know yet. This is a classic approach from active learning in machine learning.

Popular Questions: Asks the most commonly useful questions across different scenarios (like “What are your interests?” or “What’s your budget?”).

Random: Randomly selects questions from a pool of possibilities. This is our sanity check baseline.

Task-Driven: Uses hand-crafted rules to ask questions based purely on what the task requires, without considering what it has already learned.

No Questions: Attempts tasks without asking any questions at all. This shows what happens when an agent tries to operate on zero information.

Understanding the Metrics #

We measured performance across six dimensions:

Task Success (primary metric): Did the agent successfully complete the task? For movie recommendations, this means: did the recommended movies actually match the user’s genre preferences? This is measured as a percentage (0-100%).

Coverage: What percentage of the user’s relevant attributes did the agent discover? If a user has 10 relevant facts (like favorite genre, disliked actors, preferred era), and the agent learned 6 of them, coverage is 60%.

Information Gain: How many “bits” of information did the agent learn? This is a formal measure from information theory. Higher numbers mean the agent asked more informative questions. Think of it as: how much did you reduce your uncertainty about the user?

Efficiency: How much information did you gain per question asked? This is calculated as information gain divided by number of questions. High efficiency means you got a lot of value from each question.

Appropriateness: What percentage of questions were appropriate? Questions can be inappropriate if they are overly sensitive (asking about health conditions when planning a birthday party), too broad (“Tell me about yourself”), or redundant (asking the same thing twice).

Questions Asked: Average number of questions the agent used out of its budget (typically 5).

Results: Extended Thinking and Explicit Memory Both Help #

Here’s what we found across 100 scenarios:

Rank Agent Task Success Coverage Questions Appropriateness Efficiency Type
1 Claude Opus 4.5 78.30% 64.67% 5.0 60.80% 19.35% AI (Extended Thinking)
2 Claude + Memory Tools 77.20% 68.67% 5.0 55.60% 18.26% AI (Memory Tools)
3 Claude Sonnet 4.5 75.53% 63.00% 5.0 60.20% 19.11% AI (Baseline)
4 GPT-5.2 + Memory 75.17% 65.33% 2.1 87.13% 38.32% AI (Memory)
5 Oracle 68.40% 88.67% 1.8 100.00% 61.05% Heuristic (Cheating)
6 Uncertainty Sampling 67.40% 84.00% 5.0 69.80% 19.74% Heuristic
7 Popular Questions 66.40% 84.67% 5.0 79.80% 19.69% Heuristic
8 Random 59.83% 72.00% 5.0 63.80% 17.75% Baseline
9 Task-Driven 51.10% 46.33% 4.0 95.73% 25.65% Heuristic
10 No Questions 30.00% 0.00% 0.0 100.00% 0.00% Baseline

Key Observations #

The Oracle Paradox: The Oracle agent (which has direct access to user information) achieved only 68.40% success despite knowing everything. This suggests that completing tasks successfully requires more than just having information. You also need to reason about what information is relevant and how to use it.

Two Distinct Strategies Emerge:

  • Comprehensive approach (Claude family): Ask all 5 questions, achieve high task success (75-78%), but lower appropriateness (55-60%)
  • Surgical approach (GPT-5.2): Ask only 2.1 questions on average, achieve competitive success (75.17%), with exceptional appropriateness (87.13%)

Simple Heuristics Are Competitive: Uncertainty sampling and popularity-based approaches achieved 67-66% success, nearly matching some AI-powered agents. This suggests carefully designed rules can rival general-purpose AI for certain information gathering tasks.

Key Insights #

1. Extended Thinking Provides a Measurable Advantage #

Claude Opus 4.5 with extended thinking outperformed the baseline Claude Sonnet by 2.77 percentage points (78.30% vs 75.53%). This demonstrates that giving models computational budget to “think” about their questions before asking leads to better strategic decisions.

Extended thinking works by allocating extra tokens (up to 5,000 in our configuration) for the model to reason internally before generating its final question. This reasoning is not shown to the user but helps the model plan more carefully.

2. Explicit Memory Management Improves Coverage #

The Claude agent equipped with memory tools achieved the highest coverage (68.67%) among all agents. By giving the model explicit functions to record facts, check what it knows, and list its knowledge, we externalized memory management. This helped the model avoid asking redundant questions and build a more complete picture of the user.

Without these tools, models must rely entirely on the conversation context to remember what they’ve learned, which can lead to forgetting or confusion as conversations grow longer.

3. Efficiency vs. Thoroughness Tradeoff #

GPT-5.2 with memory demonstrated the most efficient information gathering, asking only 2.1 questions on average while achieving 75.17% success. In contrast, Claude Opus asked all 5 questions every time. This suggests different strategies:

  • GPT-5.2: Ask fewer, highly targeted questions and stop early when confident
  • Claude Opus: Use the full question budget to gather comprehensive information

Neither approach is strictly better. The choice depends on the use case. In time-sensitive applications, efficiency matters more. In high-stakes decisions, thoroughness might be worth the extra questions.

4. GPT-5.2 Excels at Question Appropriateness #

Appropriateness scores revealed a striking difference between models. GPT-5.2 with memory achieved 87.13% appropriateness, vastly outperforming the Claude family (55-60%). This 27-31 percentage point gap suggests GPT-5.2 is significantly better at avoiding overly sensitive, redundant, or unnecessarily broad questions.

This difference becomes even more impressive when combined with GPT-5.2’s efficiency. It asks only 2.1 questions on average, and those questions are much more likely to be appropriate. Meanwhile, Claude models ask all 5 questions but nearly half contain some form of inappropriateness (redundancy, excessive sensitivity, or unnecessary breadth).

This suggests different model training or design choices lead to fundamentally different question-asking behaviors, even when both achieve similar task success rates.

5. Rule-Based Heuristics Are Competitive #

Uncertainty sampling and popularity-based heuristics achieved 67.40% and 66.40% respectively, nearly matching some AI-powered approaches. This suggests that for certain well-defined information gathering tasks, carefully designed rules can rival general-purpose AI models.

However, AI models have the advantage of flexibility and adaptability to new scenarios without requiring manual rule updates.

What AI Devs & Researchers Can Take Away #

The Information Paradox: Retrieval vs. Reasoning #

The Oracle achieving only 68.40% despite having perfect information reveals a critical insight: the bottleneck in AI task completion isn’t always information retrieval. It’s information processing and relevance filtering. This has direct implications for RAG (Retrieval-Augmented Generation) systems and long-context applications.

Simply dumping all available information into an LLM’s context window (even if it fits) doesn’t guarantee better performance. Models need explicit reasoning about what information is relevant for the current task. This implies that retrieval quality (precision over recall) matters more than retrieval quantity, contrary to the “just throw everything in the context window” approach that’s become common with 100k+ context models.

For production systems, the essence is investing in retrieval filtering and ranking mechanisms, not just expanding context windows.

Model-Specific Question Generation Behaviors #

The 27-31 percentage point appropriateness gap between GPT-5.2 (87.13%) and Claude (55-60%) reveals that different model families have fundamentally different question-generation mechanisms, even when achieving similar task success rates. This difference persists despite similar architectures and training paradigms.

Three hypotheses worth investigating:

  1. Training data composition: GPT-5.2 may have seen more examples of conversational appropriateness during training (customer service dialogues, interview transcripts) vs Claude’s emphasis on general helpfulness.
  2. RLHF optimization targets: The models may be optimized for different objectives. Claude’s lower appropriateness but higher task success suggests optimization for information coverage. GPT-5.2’s high appropriateness but lower question count suggests optimization for user experience and efficiency.
  3. Prompt sensitivity: The same system prompt may elicit different behaviors from different models. What reads as “ask strategic questions” to GPT-5.2 might read as “ask all possible questions” to Claude.

For AI Developers/Researchers: don’t assume prompts are model-agnostic. The same prompt can produce drastically different questioning strategies across models. Test appropriateness metrics in addition to task success when evaluating question-asking systems.

The Cost-Benefit Analysis of Extended Thinking #

Extended thinking provided a 2.77 percentage point improvement (75.53% to 78.30%) at the cost of 5000 additional tokens per question. With 5 questions per scenario, that’s 25,000 extra tokens per scenario.

At current API pricing (~$15 per million tokens for Claude Opus), this is roughly $0.375 per scenario for a ~3% improvement. Whether this is worth it depends entirely on your use case:

  • High-stakes scenarios (medical recommendations, financial planning): Absolutely worth it.
  • High-volume, low-stakes scenarios (general chat, simple recommendations): Probably not.

The marginal improvement also suggests diminishing returns. Going from no extended thinking (Claude Sonnet) to extended thinking (Claude Opus) gives you 2.77 points. But simple heuristics can get you to 67% with zero LLM costs. The decision isn’t just “extended thinking or not” but “where along the cost/performance curve do you want to operate?”

Explicit Memory as a Hedge Against Context Window Limitations #

Claude with memory tools achieved the highest coverage (68.67%) by externalizing memory management. This contradicts the narrative that larger context windows solve all memory problems. Even with effectively unlimited context (Claude models have 200k+ token windows), explicit memory structures improved performance.

Why? Because in-context learning creates interference. Every new fact competes for attention with every previous fact. The model must re-scan the entire conversation history to check “have I asked this before?” With explicit memory tools, this becomes a simple lookup: check_if_known("favorite_cuisine").

This has architectural implications. Rather than relying solely on retrieval from long context, consider hybrid approaches:

  • Structured memory for facts (key-value stores, databases)
  • Unstructured context for nuance and detail
  • Explicit APIs for memory operations rather than hoping the model “remembers”

This isn’t just about avoiding redundancy. It’s about cognitive offloading. The model doesn’t waste reasoning capacity remembering facts when it could be using that capacity to reason about what to ask next.

Heuristics as a Ceiling Check #

Simple heuristics achieving 67-66% (Uncertainty and Popular) while top AI models achieve 75-78% suggests two things:

  1. The problem space has structure: Well-designed rules can capture 85% of the performance of frontier models (67/78 = 85.9%). This means the task isn’t as open-ended as it appears. There are patterns to good questioning.
  2. AI models aren’t fully utilizing their capabilities: The gap between heuristics and AI is only 8-11 percentage points. For comparison, in chess, the gap between a simple heuristic (material counting) and a modern engine is enormous. Here, it’s relatively small.

This suggests that current LLMs, despite their sophistication, are operating somewhat like “smart heuristics” for question-asking rather than demonstrating qualitatively different reasoning. The strategic planning capabilities we associate with frontier models aren’t fully expressed in this task.

In the next step, we’re going to investigate: what’s preventing models from achieving 95%+ on this benchmark? Is it the training data? The prompting? The fundamental architecture? The 78% ceiling feels artificially low for models that can write code and prove theorems.

Limitations and Future Work #

Our benchmark has several limitations to keep in mind:

Simulated Users: We tested agents against persona-based simulators, not real humans. Real users might refuse to answer certain questions, provide incomplete answers, or become frustrated with redundant questioning.

Limited Scope: 100 scenarios across 6 domains is a start, but we need more comprehensive evaluation across diverse tasks, cultures, and languages.

Automatic Evaluation: Task success is measured automatically by comparing outputs to ground truth. Human evaluation would provide richer insights into answer quality and appropriateness.

Single Turn Scenarios: Each scenario is independent. We haven’t tested multi-session memory or long-term relationship building.

Coming Soon #

We plan to expand QuestionBench with:

  • Additional Models: Gemini CLI, GLM 4.5, DeepSeek R1, and other frontier models
  • Memory-Augmented Variants: Testing different memory architectures and retrieval mechanisms
  • Human Evaluation: Having real people rate the appropriateness and helpfulness of questions
  • Multi-Session Scenarios: Testing whether agents can remember information across multiple conversations
  • Adversarial Testing: Seeing how agents handle users who provide contradictory or misleading information

The Wrap Up #

Our evaluation shows that AI models can learn to ask strategic questions, but it requires careful design. Extended thinking, explicit memory management, and thoughtful prompting all contribute to better performance. The best agent (Claude Opus 4.5 with extended thinking) achieved 78.30% task success, but there’s still a significant gap to perfect performance.

As AI systems become more autonomous and take on more complex tasks, their ability to recognize what they don’t know and ask the right questions will become increasingly important. QuestionBench provides a standardized way to measure and improve this capability.

The future of AI isn’t just about systems that answer questions well. It’s about systems that know what questions to ask.


Author: Talha Chowdhury Date: January 4, 2026